Machine learning cyphort_malware_most_wanted

Machine Learning:
The Gold Standard For Threat Detection (?)

Your speakers today
Rohan Tandon
Data Scientist
Shel Sharma
Product Marketing Director

Threat Monitoring &
Research team
________
24X7 monitoring for
malware events
________
Assist customers with
their Forensics and
Incident Response
We enhance malware
detection accuracy
________
False positives/negatives
________
Deep-dive research
We work with the
security ecosystem
________
Contribute to and learn
from malware KB
________
Best of 3rd Party threat
data
cyphort.com/blog

Agenda
o Hype vs. Reality
o Security Applications
o The Machine Learning Toolkit
o Advantages and Pitfalls
o Takeaways and Q&A

Hype vs. Reality: Reality
Data, numerical software, high performance computing
Prediction, classification, pattern discovery

Security Applications
o Given all the information I know about a file or event:
o Is a file or event malicious? (Yes, No)
o If malicious, what type of malware is it? (Trojan, Worm, Adware, etc.)
o How can I quantify the risk of the attack? (High, Medium, Low)
o How can I determine if the attack is part of a larger campaign against my
infrastructure? How likely am I to get hit again? (Next hour, week, month)

Security Applications - Traditional Approaches
o Static
o Packer, file type, file size, code obfuscation
o Defense by checksum match, static property signatures
o Scalable, but lacking coverage
o Behavioral
o Logged behavior from sandboxing (file creation, cnc activity, etc)
o Manually create “behavioral signatures”
o Better coverage, but not always scalable (more possibilities)
o Reputation
o “Crowdsourcing” the detection problem
o Can’t detect targeted threats.
Coverage vs Scalability!

Security Applications - Machine Learning
Static Data Reputation Data
Behavior Data
Machine Learning
Detection, Classification, Risk Assessment
+ Coverage and Scalability

The Machine Learning Toolkit
Discovering statistical relationships with data:
BIG DATA
SMALL
DATA
...BUT MAINLY RELEVANT DATA

The Machine Learning Toolkit - Data is King
All machine learning models need to be “trained” on data.
File/event samples
(Training Data)
Feature extraction
● Static
● Behavioral
● Reputation
● etc.
Train ModelAnalyze, clean and
normalize data

All machine learning models need to be “trained” on data.
File/event samples
(Training Data)
Feature extraction
● Static
● Behavioral
● Reputation
● etc.
Train ModelAnalyze, clean and
normalize data
The training data is the most important factor in the success of the model.

Types of machine learning:
o Supervised Learning
o Unsupervised Learning
o Semi-supervised Learning (Combination of supervised + unsupervised)

The Machine Learning Toolkit - Supervised Learning
Supervised Learning:
o The ”outcome” of each training sample is already known
o EXAMPLE: “Binary” classification
Object 1 Features
Object 2 Features
Object 3 Features
…
Object 1001 Features
Object 1 = Malware
Object 2 = Clean
Object 3 = Clean
Object 4 = Malware
Object 5 = Malware
Object 6 = Malware
…
Object 2000 = Clean
Training Data
Train Model
Test Model on Unknown Samples
New Object Features
Malware?
Clean?
o Techniques
o Linear/Logistic Regression
o Support Vector Machines
o Classification Trees, Random Forests
o Neural Networks (“Deep Learning”)

The Machine Learning Toolkit - Unsupervised Learning
Unsupervised Learning:
o The ”outcome” of each training sample is unknown
o EXAMPLE: Finding families of malware
Malware 1 Features
Malware 2 Features
Malware 3 Features
…
Malware 1001 Features
Training Data
Train Model
Discover similar “groupings” of samples
Group 1
Malware 17
Malware 1
Malware 264
...
Group 2
Malware 107
Malware 6
Malware 2
...
Group 3
Malware 936
Malware 851
Malware 1001
...
o Techniques
o Clustering algorithms
o Self-organizing maps
o Principal components analysis
o Archetypal analysis

The Machine Learning Toolkit - Clustering
Clustering is a popular ML tool in malware analysis.
(Feature = “Dimension”)

The Machine Learning Toolkit - Clustering
Clustering is a popular ML tool in malware analysis.
But things break down in higher dimensions!
(Feature = “Dimension”)

The Curse of Dimensionality (Bellman, 1961)
Objects that are close together in 2-D space may be much farther apart in higher
dimensions (Some math ahead!)
1 X 1 2-D feature space
Feature 2
Feature1

Feature 2
Feature1
Want to cluster samples in this 0.1 X 0.1 square
(1% of all the possible data)

Feature 2
Feature1
In 2-D: Must cover 10% of range of each feature

Feature 2
Feature1
In 3-D: To cover 1% of total data volume:
● Must cover ~ 21.5% of range of each
feature: (0.215)3 ~ 0.01

Feature 2
Feature1
feature: (0.215)3 ~ 0.01
feature: (0.955)100 ~ 0.01

The Curse of Dimensionality (contd.)
o Ignore the curse ⇒ bad models, more false positives/false negatives
o Solutions: Dimensionality reduction, redefine “closeness”, or skip clustering in
favor of other algorithms:
Support Vector Machines
Penalized Regression
Random Forests

Supervised Learning - Regression, SVM et. al
o Not sensitive to the “curse”
o Statistical function approximation
o Powerful, scalable with large datasets + many features

Supervised Learning - Regression, SVM et. al
o Not sensitive to the “curse”
o Statistical function approximation
o Powerful, scalable with large datasets + many features
BUT…
o The dataset and features still carry the most weight.
o Use as a “black box” could result in catastrophic failure!
Model creation should “extend a hypothesis”

Supervised Learning - Data is STILL king!
o “Models should extend a hypothesis”
o Features should have a reason to be used.
o Avoid “Spurious correlations”
Correlation
≠
Causation

But there’s more...
The dangers of modeling data before you fully understand it:
Relationships may change across different groups!

Supervised Learning - In a nutshell
“Separate the Signal From the Noise”

Separating the Signal from the Noise
But what if the signal is too sparse?

A “Needle in the haystack” situation:
o Low prevalence of threat (1 out of every 100,000 objects)
o Built a supervised classifier which can detect 95% of threats, with a 1% FP rate
o A FP is 1000x more likely (1 out of 100) than a detection! (High False Discovery Rate)
o ML may be able to detect the signal, but not without too much noise.
For any machine learning algorithm,
there is always a tradeoff between high detection and low false positives.

How to increase detection without increasing false positives:
Collect better, more predictive training data!
(Predicted initial
reaction from
engineering manager
with 95% confidence)

Takeaways
o When done correctly, ML offers both coverage and scalability in threat detection
o It is not without its own shortcomings:
o The “needle in the haystack” problem
o Spurious correlations
o It is still one of the most scalable ways to detect targeted zero-day attacks, when
coupled with behavioral analysis
o The “Gold Standard” for successfully using machine learning
o Know your data - let it guide your use of ML, not the other way around.
o Know the benefits and pitfalls of your algorithms
o Be ready to iterate, rinse, and repeat

Thank You!
Q&A
Previous MMW slides/recordings on
http://cyphort.com/labs/malwares-wanted/

Machine learning cyphort_malware_most_wanted

Machine learning cyphort_malware_most_wanted

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Machine learning cyphort_malware_most_wanted

Similar to Machine learning cyphort_malware_most_wanted (20)

More from Cyphort

More from Cyphort (20)

Recently uploaded

Recently uploaded (20)

Machine learning cyphort_malware_most_wanted