Malware identification using
Machine Learning
By: Japneet Singh
Agenda
• Malware identification/detection process
• Need for automated malware analysis and detection
• Machine Learning
• Identifying malware using Supervised Learning
• Demo
• References
What’s a malware?
How AV companies identify malware?
Stage 1 - In-House
• Manual/Automated analysis of digital artifacts
• Static analysis
• Dynamic analysis
• Signature generation and packaging
Stage 2 - On customer machines
• Classifying digital artifacts by matching against signatures
• Taking actions
What’s in a signature
• Specific indicators like file path, name, size, network connections etc.
• Generic indicators like unique patterns in malware files
• Blacklisted file hashes
• Behavior indicators like sequences of malicious actions
Need for automated malware identification
• Malware detection has turned into big data problem
• Most of the new malware is almost identical to
existing malware
Machine Learning
Supervised learning
• Suitable for Classification problems and Regression problems
Classification vs Regression
Classification process at high level
Classification algorithms
• Logistic regression
• k-nearest neighbors
• Decision trees
• Random forests
Visualizing detection boundary: 2 features
Image source: Malware Data Science by Joshua Saxe and Hillary Sanders
Visualizing detection boundary: 3 features
Image source: Malware Data Science by Joshua Saxe and Hillary Sanders
Example datasets
Image source: Malware Data Science by Joshua Saxe and Hillary Sanders
Logistic Regression
Image source: Malware Data Science by Joshua Saxe and Hillary Sanders
k-nearest neighbors
Image source: Malware Data Science by Joshua Saxe and Hillary Sanders
k-nearest neighbors
Image source: Malware Data Science by Joshua Saxe and Hillary Sanders
Decision Tree
Image source: Malware Data Science by Joshua Saxe and Hillary Sanders
Decision Tree
Image source: Malware Data Science by Joshua Saxe and Hillary Sanders
Random forest
Random forest
Image source: Malware Data Science by Joshua Saxe and Hillary Sanders
How AV companies identify Malware: Part 2
Stage 1 - In-House
• Collect malware and benign samples
• Divide samples into Train and Test sets
• Train the ML model using Train set
• Extract features from malware and benign files
• Run ML algorithm to train a model from extracted features
• Test ML model efficacy using Test set
• If efficacy does not meet expectations, Iterate
• If efficacy meets expectations, bundle the model with signatures
Stage 2 - On customer machines
• Extract features from files
• Use ML algorithm and trained ML model to classify the file
Demo
• Using an open source malware classifier “Ember” to train and classify
References
• https://www.av-test.org/en/statistics/malware/
• Ember source: https://github.com/endgameinc/ember
•
https://www.amazon.com/Malware-Data-Science-Detection-
Attribution/dp/1593278594

Malware classification using Machine Learning