Malware Analysis and Detection Using Machine Learning
Introduction
 Malware is actually a generic definition for all kind of computer
threats. A simple classification of malware consists of file infectors and
stand-alone malware. Another way of classifying malware is based on
their particular action: worms, backdoors, trojans, rootkits, spyware,
adware etc.
 This project has combined the two fields such as cyber security field
and machine learning field An overview on different machine learning
methods that were proposed for malware detection is given such as
Random Forest Classifier, Gradient Boosting, and Adaboost techniques.
CONTENTS:-
1. Aim and Objectives
2. Dataset
3. Algorithms
Model building
1. Accuracy
2. Testing
3. Conclusion
Problem Statement:
Build a classifier model after Analysing the data and to distinguish between
the legitimate(clean) files and the malicious(malware) files. we have used an
ensemble technique such as bagging and boosting techniques.
OBJECTIVES:
 To use machine learning algorithms and to detect the legitimate
files and the malware files.
 To get best testing data accuracy.
DATASET:-
In our Project there are 138048 samples(files) in which 41323
files are the Legitimate file or the clean files and most of these
files has the (.exe) and (.dll) . And the remaining 96724 files are
taken from the www.virusshare.com site whch is considered as
the malware files
 Train Dataset- 30%
 Test Dataset- 70%
MODEL BUILDING:-
 Model consists of different layers:-
 Feature Selection
 Separate the data in X and Y
 Train test split
 fit the model
 Predict the data
 Find the maximum accuracy
Model Building
3 Exploratory Data Analysis:
Exploratory Data Analysis refers to the critical process of performing initial
on data so as to discover patterns, to spot anomalies, to test hypothesis and to check
assumptions with the help of summary statistics and graphical representations
Correlation
Pearson Correlation :
+1 to -1
+1 highest positive correlation
-1 highest negative correlation
0 there is no correlation or two variables are independent of each other
Final 14 Features
Split the Data in X train and Y train
Algorithms :
Decision Tree:
The general motive of using Decision Tree is to create a training model which can
to predict class or value of target variables by learning decision rules inferred from
prior data (training data). Each internal node of the tree corresponds to an attribute,
and each leaf node corresponds to a class label
Decision Tree
Construct a Decision Tree
Random Forest Classifier
Random forest is a supervised learning algorithm. The "forest" it builds, is an
ensemble of decision trees, usually trained with the “bagging” method. The
general idea of the bagging method is that a combination of learning
models increases the overall result.
Put simply: random forest builds multiple decision trees and merges them
together to get a more accurate and stable prediction.
Random Forest Model:
Confusion Matrix :
Gradient Boost:
Gradient boosting machines are a family of powerful machine-learning
techniques that have shown considerable success in a wide range of
practical applications. They are highly customizable to the particular
needs of the application, like being learned with respect to different loss
functions. This article gives a tutorial introduction into the methodology
of gradient boosting methods with a strong focus on machine learning
aspects of modeling
Gradient Boosting :
ACCURACY:-
 After Applying these Algorithms we have
concluded that
 Decision Tree Model Accuracy- 99.04%
 Random Forest Classifier – 99.35%
 Gradient Boosting Classifier – 98.62%
Bar Plot For the Accuaracy :
CONCLUSION:-
 Machine learning models have a very great capability to surpass the human potential if
the data provided is sufficient.
 we give input test data to generated system it gives desired output.
 The system has provided the 99.35% accuracy in prediction of detecting the files .
Future Scope :
• Machine learning models have a very great capability to surpass the human
potential if the data provided is sufficient.
• The system has provided the 99.35% accuracy in prediction of detecting the
legitimate file and the malware file .
• We can predict the amount of time the question will take to get answered.
References :
 https://media.kaspersky.com/en/enterprise-security/Kaspersky-Lab-Whitepaper-Machine-
Learning.pdf
 https://www.youtube.com/watch?v=HMWulRpPpuA&list=PLonlF40eS6nzc7TqDshRo7k-
mTM1Tu_j4
 https://www.oreilly.com/library/view/practical-malware-analysis/9781593272906/ch02s05.html
THANK YOU

Presentation_Malware Analysis.pptx

  • 1.
    Malware Analysis andDetection Using Machine Learning
  • 2.
    Introduction  Malware isactually a generic definition for all kind of computer threats. A simple classification of malware consists of file infectors and stand-alone malware. Another way of classifying malware is based on their particular action: worms, backdoors, trojans, rootkits, spyware, adware etc.  This project has combined the two fields such as cyber security field and machine learning field An overview on different machine learning methods that were proposed for malware detection is given such as Random Forest Classifier, Gradient Boosting, and Adaboost techniques.
  • 3.
    CONTENTS:- 1. Aim andObjectives 2. Dataset 3. Algorithms Model building 1. Accuracy 2. Testing 3. Conclusion
  • 4.
    Problem Statement: Build aclassifier model after Analysing the data and to distinguish between the legitimate(clean) files and the malicious(malware) files. we have used an ensemble technique such as bagging and boosting techniques. OBJECTIVES:  To use machine learning algorithms and to detect the legitimate files and the malware files.  To get best testing data accuracy.
  • 5.
    DATASET:- In our Projectthere are 138048 samples(files) in which 41323 files are the Legitimate file or the clean files and most of these files has the (.exe) and (.dll) . And the remaining 96724 files are taken from the www.virusshare.com site whch is considered as the malware files  Train Dataset- 30%  Test Dataset- 70%
  • 6.
    MODEL BUILDING:-  Modelconsists of different layers:-  Feature Selection  Separate the data in X and Y  Train test split  fit the model  Predict the data  Find the maximum accuracy
  • 7.
    Model Building 3 ExploratoryData Analysis: Exploratory Data Analysis refers to the critical process of performing initial on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations
  • 8.
  • 9.
    +1 to -1 +1highest positive correlation -1 highest negative correlation 0 there is no correlation or two variables are independent of each other
  • 10.
  • 11.
    Split the Datain X train and Y train
  • 12.
    Algorithms : Decision Tree: Thegeneral motive of using Decision Tree is to create a training model which can to predict class or value of target variables by learning decision rules inferred from prior data (training data). Each internal node of the tree corresponds to an attribute, and each leaf node corresponds to a class label
  • 13.
  • 14.
  • 15.
    Random Forest Classifier Randomforest is a supervised learning algorithm. The "forest" it builds, is an ensemble of decision trees, usually trained with the “bagging” method. The general idea of the bagging method is that a combination of learning models increases the overall result. Put simply: random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.
  • 16.
  • 17.
  • 18.
    Gradient Boost: Gradient boostingmachines are a family of powerful machine-learning techniques that have shown considerable success in a wide range of practical applications. They are highly customizable to the particular needs of the application, like being learned with respect to different loss functions. This article gives a tutorial introduction into the methodology of gradient boosting methods with a strong focus on machine learning aspects of modeling
  • 19.
  • 20.
    ACCURACY:-  After Applyingthese Algorithms we have concluded that  Decision Tree Model Accuracy- 99.04%  Random Forest Classifier – 99.35%  Gradient Boosting Classifier – 98.62%
  • 21.
    Bar Plot Forthe Accuaracy :
  • 22.
    CONCLUSION:-  Machine learningmodels have a very great capability to surpass the human potential if the data provided is sufficient.  we give input test data to generated system it gives desired output.  The system has provided the 99.35% accuracy in prediction of detecting the files .
  • 23.
    Future Scope : •Machine learning models have a very great capability to surpass the human potential if the data provided is sufficient. • The system has provided the 99.35% accuracy in prediction of detecting the legitimate file and the malware file . • We can predict the amount of time the question will take to get answered.
  • 24.
    References :  https://media.kaspersky.com/en/enterprise-security/Kaspersky-Lab-Whitepaper-Machine- Learning.pdf https://www.youtube.com/watch?v=HMWulRpPpuA&list=PLonlF40eS6nzc7TqDshRo7k- mTM1Tu_j4  https://www.oreilly.com/library/view/practical-malware-analysis/9781593272906/ch02s05.html
  • 25.