This document discusses machine learning approaches for threat detection in cybersecurity. It begins with an overview of machine learning applications in security like malware detection and classification. It then covers the machine learning toolkit, emphasizing that data is the most important factor. It describes supervised learning techniques like regression and support vector machines. It also discusses challenges like the curse of dimensionality and separating sparse signals from noise in the data. The key takeaways are that machine learning can provide scalable threat detection when done correctly by focusing on relevant predictive data and understanding its limitations and algorithms.
4. Threat Monitoring &
Research team
________
24X7 monitoring for
malware events
________
Assist customers with
their Forensics and
Incident Response
We enhance malware
detection accuracy
________
False positives/negatives
________
Deep-dive research
We work with the
security ecosystem
________
Contribute to and learn
from malware KB
________
Best of 3rd Party threat
data
cyphort.com/blog
5. Agenda
o Hype vs. Reality
o Security Applications
o The Machine Learning Toolkit
o Advantages and Pitfalls
o Takeaways and Q&A
7. Hype vs. Reality: Reality
Data, numerical software, high performance computing
Prediction, classification, pattern discovery
8. Security Applications
o Given all the information I know about a file or event:
o Is a file or event malicious? (Yes, No)
o If malicious, what type of malware is it? (Trojan, Worm, Adware, etc.)
o How can I quantify the risk of the attack? (High, Medium, Low)
o How can I determine if the attack is part of a larger campaign against my
infrastructure? How likely am I to get hit again? (Next hour, week, month)
9. Security Applications - Traditional Approaches
o Static
o Packer, file type, file size, code obfuscation
o Defense by checksum match, static property signatures
o Scalable, but lacking coverage
o Behavioral
o Logged behavior from sandboxing (file creation, cnc activity, etc)
o Manually create “behavioral signatures”
o Better coverage, but not always scalable (more possibilities)
o Reputation
o “Crowdsourcing” the detection problem
o Can’t detect targeted threats.
Coverage vs Scalability!
10. Security Applications - Machine Learning
Static Data Reputation Data
Behavior Data
Machine Learning
Detection, Classification, Risk Assessment
+ Coverage and Scalability
11. The Machine Learning Toolkit
Discovering statistical relationships with data:
BIG DATA
SMALL
DATA
...BUT MAINLY RELEVANT DATA
12. The Machine Learning Toolkit - Data is King
All machine learning models need to be “trained” on data.
File/event samples
(Training Data)
Feature extraction
● Static
● Behavioral
● Reputation
● etc.
Train ModelAnalyze, clean and
normalize data
13. The Machine Learning Toolkit - Data is King
All machine learning models need to be “trained” on data.
File/event samples
(Training Data)
Feature extraction
● Static
● Behavioral
● Reputation
● etc.
Train ModelAnalyze, clean and
normalize data
The training data is the most important factor in the success of the model.
14. The Machine Learning Toolkit - Data is King
Types of machine learning:
o Supervised Learning
o Unsupervised Learning
o Semi-supervised Learning (Combination of supervised + unsupervised)
15. The Machine Learning Toolkit - Supervised Learning
Supervised Learning:
o The ”outcome” of each training sample is already known
o EXAMPLE: “Binary” classification
Object 1 Features
Object 2 Features
Object 3 Features
…
Object 1001 Features
Object 1 = Malware
Object 2 = Clean
Object 3 = Clean
Object 4 = Malware
Object 5 = Malware
Object 6 = Malware
…
Object 2000 = Clean
Training Data
Train Model
Test Model on Unknown Samples
New Object Features
Malware?
Clean?
o Techniques
o Linear/Logistic Regression
o Support Vector Machines
o Classification Trees, Random Forests
o Neural Networks (“Deep Learning”)
16. The Machine Learning Toolkit - Unsupervised Learning
Unsupervised Learning:
o The ”outcome” of each training sample is unknown
o EXAMPLE: Finding families of malware
Malware 1 Features
Malware 2 Features
Malware 3 Features
…
Malware 1001 Features
Training Data
Train Model
Discover similar “groupings” of samples
Group 1
Malware 17
Malware 1
Malware 264
...
Group 2
Malware 107
Malware 6
Malware 2
...
Group 3
Malware 936
Malware 851
Malware 1001
...
o Techniques
o Clustering algorithms
o Self-organizing maps
o Principal components analysis
o Archetypal analysis
17. The Machine Learning Toolkit - Clustering
Clustering is a popular ML tool in malware analysis.
(Feature = “Dimension”)
18. The Machine Learning Toolkit - Clustering
Clustering is a popular ML tool in malware analysis.
But things break down in higher dimensions!
(Feature = “Dimension”)
19. The Curse of Dimensionality (Bellman, 1961)
Objects that are close together in 2-D space may be much farther apart in higher
dimensions (Some math ahead!)
1 X 1 2-D feature space
Feature 2
Feature1
20. The Curse of Dimensionality (Bellman, 1961)
Objects that are close together in 2-D space may be much farther apart in higher
dimensions (Some math ahead!)
1 X 1 2-D feature space
Feature 2
Feature1
Want to cluster samples in this 0.1 X 0.1 square
(1% of all the possible data)
21. The Curse of Dimensionality (Bellman, 1961)
Objects that are close together in 2-D space may be much farther apart in higher
dimensions (Some math ahead!)
1 X 1 2-D feature space
Feature 2
Feature1
Want to cluster samples in this 0.1 X 0.1 square
(1% of all the possible data)
In 2-D: Must cover 10% of range of each feature
22. The Curse of Dimensionality (Bellman, 1961)
Objects that are close together in 2-D space may be much farther apart in higher
dimensions (Some math ahead!)
1 X 1 2-D feature space
Feature 2
Feature1
Want to cluster samples in this 0.1 X 0.1 square
(1% of all the possible data)
In 2-D: Must cover 10% of range of each feature
In 3-D: To cover 1% of total data volume:
● Must cover ~ 21.5% of range of each
feature: (0.215)3 ~ 0.01
23. The Curse of Dimensionality (Bellman, 1961)
Objects that are close together in 2-D space may be much farther apart in higher
dimensions (Some math ahead!)
1 X 1 2-D feature space
Feature 2
Feature1
Want to cluster samples in this 0.1 X 0.1 square
(1% of all the possible data)
In 2-D: Must cover 10% of range of each feature
In 3-D: To cover 1% of total data volume:
● Must cover ~ 21.5% of range of each
feature: (0.215)3 ~ 0.01
In 100-D: To cover 1% of total data volume:
● Must cover ~ 95.5% of range of each
feature: (0.955)100 ~ 0.01
24. The Curse of Dimensionality (contd.)
o Ignore the curse ⇒ bad models, more false positives/false negatives
o Solutions: Dimensionality reduction, redefine “closeness”, or skip clustering in
favor of other algorithms:
Support Vector Machines
Penalized Regression
Random Forests
25. Supervised Learning - Regression, SVM et. al
o Not sensitive to the “curse”
o Statistical function approximation
o Powerful, scalable with large datasets + many features
26. Supervised Learning - Regression, SVM et. al
o Not sensitive to the “curse”
o Statistical function approximation
o Powerful, scalable with large datasets + many features
BUT…
o The dataset and features still carry the most weight.
o Use as a “black box” could result in catastrophic failure!
Model creation should “extend a hypothesis”
27. Supervised Learning - Data is STILL king!
o “Models should extend a hypothesis”
o Features should have a reason to be used.
o Avoid “Spurious correlations”
Correlation
≠
Causation
28. But there’s more...
The dangers of modeling data before you fully understand it:
Relationships may change across different groups!
31. Separating the Signal from the Noise
A “Needle in the haystack” situation:
o Low prevalence of threat (1 out of every 100,000 objects)
o Built a supervised classifier which can detect 95% of threats, with a 1% FP rate
o A FP is 1000x more likely (1 out of 100) than a detection! (High False Discovery Rate)
o ML may be able to detect the signal, but not without too much noise.
For any machine learning algorithm,
there is always a tradeoff between high detection and low false positives.
32. Separating the Signal from the Noise
How to increase detection without increasing false positives:
Collect better, more predictive training data!
(Predicted initial
reaction from
engineering manager
with 95% confidence)
33. Takeaways
o When done correctly, ML offers both coverage and scalability in threat detection
o It is not without its own shortcomings:
o The “needle in the haystack” problem
o Spurious correlations
o It is still one of the most scalable ways to detect targeted zero-day attacks, when
coupled with behavioral analysis
o The “Gold Standard” for successfully using machine learning
o Know your data - let it guide your use of ML, not the other way around.
o Know the benefits and pitfalls of your algorithms
o Be ready to iterate, rinse, and repeat