SlideShare a Scribd company logo
1 of 19
Malware Detection using Machine Learning
By:
Shubham Dubey(14ucs114)
Malware overview
●Malicious software that tries to damage or perform unauthorized
access to your system.
●Can be of different type:
Virus | Trojan | Adware | Worm etc
●More then 1 Lacs new samples found by AV companies every day.
●Most of them are Variant of each other or some old samples.
Current status of Detection
●Currently Antivirus company use signature based detection.
●Signature can be anything from strings to assembly code
snippets.
Problem with current method
●Polymorphic malware can change their code on every execution.
●Most malware can encrypt or Pack themselves using packers.
●Detecting those malware using signature doesn’t work all the time.
Solution using ML
●API sequence features can be used to detect if a file is malicious
or not.
●API calls are robust way of analysis as they cannot be alter easily.
●They outline everything happening to the operating system,
including the operations on the files,registry, mutexes, processes
and other features mentioned earlier.
●For example, OpenFile, CreateFile define the file operations,
OpenMutex, CreateMutex and describe mutexes opened/created.
Our System Description
●Cuckoo sandbox is used to analyze and record all Api calls.
●File report get saved into json format file.
●Calls get parsed to save inside csv file in matrix vector format.
●Samples with less then 10 calls(or any other user set value) get
ignored.
Feature extraction
● The frequency representation approach has been taken.
●S1,S2..are sample number. API1,API2 are Calls made to an API.
Redundant subsequence removal
methods
●There are large number of useless api call sequence present.
●They can be removed using N-gram sample subsequence
extraction.
●Match if some api calling pattern is present in many sample then
remove it.(Works like sliding window)
Redundant subsequence removal
methods
●Other method can be using information gain
●C entropy of the malware detection system
● H(C) is the information entropy
●The information gain of the subsequence T to class C is:
●p(ti) is the probability that the feature appear and p(tj) is opposite.
Using machine learning methods
●After the features were extracted and selected, we can apply the
machine learning methods to the data that we obtained.
●The packages used for the implementation of algorithms are:
Random Forest – randomForest
K-Nearest Neighbours – class
Support Vector Machines – kernlab
J48 Decision Tree – RWeka
Using machine learning methods
●After the features were extracted and selected, we can apply the
machine learning methods to the data that we obtained.
●The packages used for the implementation of algorithms are:
Random Forest – randomForest
K-Nearest Neighbours – class
Support Vector Machines – kernlab
J48 Decision Tree – RWeka
Comparison method
●The Cuckoo analysis score is an indication of how malicious an
analyzed file is.
●In total, there are three levels of severity and all levels have their
score of severity: 1 for low, 2 for medium and 3 for high.
●It is hard to measure the accuracy of the detection since there is
no threshold value indicating whether the sample is malicious or
not.
●This can be compare with the result received by ML algorithms.
Results
●The accuracy of detection is measured as the
percentage of correctly identified instances:
Support Vector Machines Results
●The overall accuracy achieved was 87.6% for multi-
class classification and 94.6% for binary classification.
Random Forest Results
●The algorithm resulted in a good accuracy of
predictions, 95.69% for multi-class classification and
96.8% for binary classification.
KNN Results
●As it can be seen, the best accuracy was achieved with
k=1. The algorithm resulted in a good accuracy of 87%
for multi-class classification and 94.6% for two-class
classification.
Conclusion
Experiments show that the integrated Machine learning
classifier has a better performance than the separate
signature based Detection.
Conclusion
In classification problems, different models gave different
results. The lowest accuracy was achieved by Naive Bayes
(72.34% and 55%), followed by k-Nearest-Neighbors and
Support Vector Machines (87%, 94.6% and 87.6%,
94.6% respectively). The highest accuracy was achieved
with the J48 and Random Forest models, and it was equal
to 93.3% and 95.69% for multi-class classification and
94.6% and 96.8% for binary classification respectively.
Thank You

More Related Content

What's hot

Malware classification and detection
Malware classification and detectionMalware classification and detection
Malware classification and detection
Chong-Kuan Chen
 

What's hot (20)

IRJET- Android Malware Detection using Machine Learning
IRJET-  	  Android Malware Detection using Machine LearningIRJET-  	  Android Malware Detection using Machine Learning
IRJET- Android Malware Detection using Machine Learning
 
Introduction to Malware Detection and Reverse Engineering
Introduction to Malware Detection and Reverse EngineeringIntroduction to Malware Detection and Reverse Engineering
Introduction to Malware Detection and Reverse Engineering
 
Malware Detection Approaches using Data Mining Techniques.pptx
Malware Detection Approaches using Data Mining Techniques.pptxMalware Detection Approaches using Data Mining Techniques.pptx
Malware Detection Approaches using Data Mining Techniques.pptx
 
Malware analysis
Malware analysisMalware analysis
Malware analysis
 
Malware classification and detection
Malware classification and detectionMalware classification and detection
Malware classification and detection
 
Fast detection of Android malware: machine learning approach
Fast detection of Android malware: machine learning approachFast detection of Android malware: machine learning approach
Fast detection of Android malware: machine learning approach
 
Practical Malware Analysis: Ch 0: Malware Analysis Primer & 1: Basic Static T...
Practical Malware Analysis: Ch 0: Malware Analysis Primer & 1: Basic Static T...Practical Malware Analysis: Ch 0: Malware Analysis Primer & 1: Basic Static T...
Practical Malware Analysis: Ch 0: Malware Analysis Primer & 1: Basic Static T...
 
Malware Detection Using Data Mining Techniques
Malware Detection Using Data Mining Techniques Malware Detection Using Data Mining Techniques
Malware Detection Using Data Mining Techniques
 
Android malware presentation
Android malware presentationAndroid malware presentation
Android malware presentation
 
Kali Linux Installation - VMware
Kali Linux Installation - VMwareKali Linux Installation - VMware
Kali Linux Installation - VMware
 
Basic malware analysis
Basic malware analysisBasic malware analysis
Basic malware analysis
 
Introduction to Malware Analysis
Introduction to Malware AnalysisIntroduction to Malware Analysis
Introduction to Malware Analysis
 
Malware analysis, threat intelligence and reverse engineering
Malware analysis, threat intelligence and reverse engineeringMalware analysis, threat intelligence and reverse engineering
Malware analysis, threat intelligence and reverse engineering
 
Practical Malware Analysis: Ch 7: Analyzing Malicious Windows Programs
Practical Malware Analysis: Ch 7: Analyzing Malicious Windows Programs Practical Malware Analysis: Ch 7: Analyzing Malicious Windows Programs
Practical Malware Analysis: Ch 7: Analyzing Malicious Windows Programs
 
Nmap basics
Nmap basicsNmap basics
Nmap basics
 
Intrusion prevention system(ips)
Intrusion prevention system(ips)Intrusion prevention system(ips)
Intrusion prevention system(ips)
 
Malware Analysis Made Simple
Malware Analysis Made SimpleMalware Analysis Made Simple
Malware Analysis Made Simple
 
Crime prediction-using-data-mining
Crime prediction-using-data-miningCrime prediction-using-data-mining
Crime prediction-using-data-mining
 
Malware Static Analysis
Malware Static AnalysisMalware Static Analysis
Malware Static Analysis
 
Machine Learning Approaches for Crime Pattern Detection
Machine Learning Approaches for Crime Pattern DetectionMachine Learning Approaches for Crime Pattern Detection
Machine Learning Approaches for Crime Pattern Detection
 

Similar to Malware Dectection Using Machine learning

Automatically generated win32 heuristic virus detection
Automatically generated win32 heuristic virus detectionAutomatically generated win32 heuristic virus detection
Automatically generated win32 heuristic virus detection
UltraUploader
 
CISC 879 - Machine Learning for Solving Systems Problems
CISC 879 - Machine Learning for Solving Systems Problems CISC 879 - Machine Learning for Solving Systems Problems
CISC 879 - Machine Learning for Solving Systems Problems
butest
 
malware detection ppt for vtu project and other final year project
malware detection ppt for vtu project and other final year projectmalware detection ppt for vtu project and other final year project
malware detection ppt for vtu project and other final year project
NaveenAd4
 
Application of data mining based malicious code detection techniques for dete...
Application of data mining based malicious code detection techniques for dete...Application of data mining based malicious code detection techniques for dete...
Application of data mining based malicious code detection techniques for dete...
UltraUploader
 
Architecture of a morphological malware detector
Architecture of a morphological malware detectorArchitecture of a morphological malware detector
Architecture of a morphological malware detector
UltraUploader
 
Cyb 5675 class project final
Cyb 5675   class project finalCyb 5675   class project final
Cyb 5675 class project final
Craig Cannon
 

Similar to Malware Dectection Using Machine learning (20)

Automatically generated win32 heuristic virus detection
Automatically generated win32 heuristic virus detectionAutomatically generated win32 heuristic virus detection
Automatically generated win32 heuristic virus detection
 
Selecting Prominent API Calls and Labeling Malicious Samples for Effective Ma...
Selecting Prominent API Calls and Labeling Malicious Samples for Effective Ma...Selecting Prominent API Calls and Labeling Malicious Samples for Effective Ma...
Selecting Prominent API Calls and Labeling Malicious Samples for Effective Ma...
 
CISC 879 - Machine Learning for Solving Systems Problems
CISC 879 - Machine Learning for Solving Systems Problems CISC 879 - Machine Learning for Solving Systems Problems
CISC 879 - Machine Learning for Solving Systems Problems
 
An Approach of Automatic Data Mining Algorithm for Intrusion Detection and P...
An Approach of Automatic Data Mining Algorithm for Intrusion  Detection and P...An Approach of Automatic Data Mining Algorithm for Intrusion  Detection and P...
An Approach of Automatic Data Mining Algorithm for Intrusion Detection and P...
 
Icacci presentation-isi-ransomware
Icacci presentation-isi-ransomwareIcacci presentation-isi-ransomware
Icacci presentation-isi-ransomware
 
malware detection ppt for vtu project and other final year project
malware detection ppt for vtu project and other final year projectmalware detection ppt for vtu project and other final year project
malware detection ppt for vtu project and other final year project
 
Application of data mining based malicious code detection techniques for dete...
Application of data mining based malicious code detection techniques for dete...Application of data mining based malicious code detection techniques for dete...
Application of data mining based malicious code detection techniques for dete...
 
Architecture of a morphological malware detector
Architecture of a morphological malware detectorArchitecture of a morphological malware detector
Architecture of a morphological malware detector
 
Cyb 5675 class project final
Cyb 5675   class project finalCyb 5675   class project final
Cyb 5675 class project final
 
Injection Attack detection using ML for
Injection Attack detection using ML  forInjection Attack detection using ML  for
Injection Attack detection using ML for
 
IDS for IoT.pptx
IDS for IoT.pptxIDS for IoT.pptx
IDS for IoT.pptx
 
Antimalware
AntimalwareAntimalware
Antimalware
 
System Event Monitoring for Active Authentication
System Event Monitoring for Active AuthenticationSystem Event Monitoring for Active Authentication
System Event Monitoring for Active Authentication
 
IRJET - Survey on Malware Detection using Deep Learning Methods
IRJET -  	  Survey on Malware Detection using Deep Learning MethodsIRJET -  	  Survey on Malware Detection using Deep Learning Methods
IRJET - Survey on Malware Detection using Deep Learning Methods
 
DETECTION OF MALICIOUS EXECUTABLES USING RULE BASED CLASSIFICATION ALGORITHMS
DETECTION OF MALICIOUS EXECUTABLES USING RULE BASED CLASSIFICATION ALGORITHMSDETECTION OF MALICIOUS EXECUTABLES USING RULE BASED CLASSIFICATION ALGORITHMS
DETECTION OF MALICIOUS EXECUTABLES USING RULE BASED CLASSIFICATION ALGORITHMS
 
Today
TodayToday
Today
 
MINING PATTERNS OF SEQUENTIAL MALICIOUS APIS TO DETECT MALWARE
MINING PATTERNS OF SEQUENTIAL MALICIOUS APIS TO DETECT MALWAREMINING PATTERNS OF SEQUENTIAL MALICIOUS APIS TO DETECT MALWARE
MINING PATTERNS OF SEQUENTIAL MALICIOUS APIS TO DETECT MALWARE
 
MINING PATTERNS OF SEQUENTIAL MALICIOUS APIS TO DETECT MALWARE
MINING PATTERNS OF SEQUENTIAL MALICIOUS APIS TO DETECT MALWAREMINING PATTERNS OF SEQUENTIAL MALICIOUS APIS TO DETECT MALWARE
MINING PATTERNS OF SEQUENTIAL MALICIOUS APIS TO DETECT MALWARE
 
Seminar Presentation | Network Intrusion Detection using Supervised Machine L...
Seminar Presentation | Network Intrusion Detection using Supervised Machine L...Seminar Presentation | Network Intrusion Detection using Supervised Machine L...
Seminar Presentation | Network Intrusion Detection using Supervised Machine L...
 
Design and Development of an Efficient Malware Detection Using ML
Design and Development of an Efficient Malware Detection Using MLDesign and Development of an Efficient Malware Detection Using ML
Design and Development of an Efficient Malware Detection Using ML
 

Recently uploaded

Recently uploaded (20)

AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - KanchanaWSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - Kanchana
 
WSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid EnvironmentsWSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid Environments
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
WSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AI
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
 
WSO2Con2024 - Organization Management: The Revolution in B2B CIAM
WSO2Con2024 - Organization Management: The Revolution in B2B CIAMWSO2Con2024 - Organization Management: The Revolution in B2B CIAM
WSO2Con2024 - Organization Management: The Revolution in B2B CIAM
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 

Malware Dectection Using Machine learning

  • 1. Malware Detection using Machine Learning By: Shubham Dubey(14ucs114)
  • 2. Malware overview ●Malicious software that tries to damage or perform unauthorized access to your system. ●Can be of different type: Virus | Trojan | Adware | Worm etc ●More then 1 Lacs new samples found by AV companies every day. ●Most of them are Variant of each other or some old samples.
  • 3. Current status of Detection ●Currently Antivirus company use signature based detection. ●Signature can be anything from strings to assembly code snippets.
  • 4. Problem with current method ●Polymorphic malware can change their code on every execution. ●Most malware can encrypt or Pack themselves using packers. ●Detecting those malware using signature doesn’t work all the time.
  • 5. Solution using ML ●API sequence features can be used to detect if a file is malicious or not. ●API calls are robust way of analysis as they cannot be alter easily. ●They outline everything happening to the operating system, including the operations on the files,registry, mutexes, processes and other features mentioned earlier. ●For example, OpenFile, CreateFile define the file operations, OpenMutex, CreateMutex and describe mutexes opened/created.
  • 6. Our System Description ●Cuckoo sandbox is used to analyze and record all Api calls. ●File report get saved into json format file. ●Calls get parsed to save inside csv file in matrix vector format. ●Samples with less then 10 calls(or any other user set value) get ignored.
  • 7. Feature extraction ● The frequency representation approach has been taken. ●S1,S2..are sample number. API1,API2 are Calls made to an API.
  • 8. Redundant subsequence removal methods ●There are large number of useless api call sequence present. ●They can be removed using N-gram sample subsequence extraction. ●Match if some api calling pattern is present in many sample then remove it.(Works like sliding window)
  • 9. Redundant subsequence removal methods ●Other method can be using information gain ●C entropy of the malware detection system ● H(C) is the information entropy ●The information gain of the subsequence T to class C is: ●p(ti) is the probability that the feature appear and p(tj) is opposite.
  • 10. Using machine learning methods ●After the features were extracted and selected, we can apply the machine learning methods to the data that we obtained. ●The packages used for the implementation of algorithms are: Random Forest – randomForest K-Nearest Neighbours – class Support Vector Machines – kernlab J48 Decision Tree – RWeka
  • 11. Using machine learning methods ●After the features were extracted and selected, we can apply the machine learning methods to the data that we obtained. ●The packages used for the implementation of algorithms are: Random Forest – randomForest K-Nearest Neighbours – class Support Vector Machines – kernlab J48 Decision Tree – RWeka
  • 12. Comparison method ●The Cuckoo analysis score is an indication of how malicious an analyzed file is. ●In total, there are three levels of severity and all levels have their score of severity: 1 for low, 2 for medium and 3 for high. ●It is hard to measure the accuracy of the detection since there is no threshold value indicating whether the sample is malicious or not. ●This can be compare with the result received by ML algorithms.
  • 13. Results ●The accuracy of detection is measured as the percentage of correctly identified instances:
  • 14. Support Vector Machines Results ●The overall accuracy achieved was 87.6% for multi- class classification and 94.6% for binary classification.
  • 15. Random Forest Results ●The algorithm resulted in a good accuracy of predictions, 95.69% for multi-class classification and 96.8% for binary classification.
  • 16. KNN Results ●As it can be seen, the best accuracy was achieved with k=1. The algorithm resulted in a good accuracy of 87% for multi-class classification and 94.6% for two-class classification.
  • 17. Conclusion Experiments show that the integrated Machine learning classifier has a better performance than the separate signature based Detection.
  • 18. Conclusion In classification problems, different models gave different results. The lowest accuracy was achieved by Naive Bayes (72.34% and 55%), followed by k-Nearest-Neighbors and Support Vector Machines (87%, 94.6% and 87.6%, 94.6% respectively). The highest accuracy was achieved with the J48 and Random Forest models, and it was equal to 93.3% and 95.69% for multi-class classification and 94.6% and 96.8% for binary classification respectively.