Machine Learning Model Bakeoff

MODEL BAKEOFF
WHICH MODEL CAME HOT AND FRESH OUT THE
KITCHEN IN OUR MALWARE CLASSIFIER
BAKEOFF?
Data Intelligence Conference 2017
Phil Roth

Phil Roth
Data Scientist
@mrphilroth
proth@endgame.com
2
whoami

3
whoami
Hyrum Anderson
@drhyrum
Jonathan Woodbridge
@jswoodbridge
Bobby Filar
@filar
Phil Roth
Data Scientist
@mrphilroth
proth@endgame.com

Data Science in Security
4
Domain Generation Algorithm (DGA) Protection
Malicious File Classification
Anomaly Detection
Insider Threat Detection
Network Intrusion Detection System
doable
hardly doable
etc….

Mission
5
The best minds of my generation
are thinking about how to make
people click ads. That sucks.
- Jeff Hammerbacher
https://www.bloomberg.com/news/articles/2011-04-14/this-tech-bubble-is-different

Relevant
6
https://arstechnica.com/ [various]

Challenges
7
Lack of open datasets
Labels are expensive to obtain
High cost of false positives AND false negatives

Malware Classification
Antivirus. As a supervised ML
problem.
Windows Executables
(portable executables or PEs)
are sorted into two classes:
benign and malicious
8

MalwareScore
Static features
Deployed to customer
machines
Available at VirusTotal
9
https://www.virustotal.com/

Why a Model Bakeoff?
Decide on an approach
10

Why a Model Bakeoff?
Build institutional knowledge of diverse models
11

Why NOT a Model Bakeoff?
Incomplete exploration of model design space
12
Mike Bostock. Visualizing Algorithms. https://bost.ocks.org/mike/algorithms/

Why NOT a Model Bakeoff?
Setting up the bakeoff is most of the work
13
Highly Scientific and Definitely Not Made Up Data About How Hard Tasks Are

Why NOT a Bakeoff?
Optimizing the model
is squeezing the last
bits of performance.
14
Blue is so much better!!!!

Data
16
Top 10 Malicious Families
MalwareScore is trained on 6M benign and 3M malicious files
The bakeoff was carried out on an early subset of those

Features
17
Byte Histogram
Sliding Window Byte Entropy
PE Information
PE Imports
PE Exports
PE Sections
PE Version Information
Raw data based features
PE Header based features

Features
18
Byte Histogram
Sliding Window Byte Entropy
0 3 0 0 0 4 0 0 0 255 255 0 0 184 0 0 0 0 0 0 0 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 216 0 0 77 90 144 14 31 186 14 0 180 9
205 33 184 1 76 205 33 84 104 105 115 32 112 114 111 103 114 97 109 32 99
97 110 110 111 116 32 98 101 32 117 110 32 105 110 32 68 79 83 32 109 111
100 101 46 13 13 10 36 0 0 0 0 0 0 0 49 184 132 58 117 217 234 105 117 217
234 105 11 217 234 105 182 214 181 105 119 217 234 105 117 217 235 105
238 217 234 105 182 214 183 105 100 217 234 105 33 250 218 105 127 217
234 105 178 223 236 105 116 217 234 105 82 105 99 104 117 21 234 105 0 2

Features
19
PE Information
PE Imports
PE Exports
PE Sections

Features
20
PE Information
PE Imports
PE Exports
PE Sections
Feature Hashing is applied
to each of these feature
groups.

Models
21
Model architectures were chosen by team member
most knowledgeable
Small grid search carried out to find the best
parameters

Metrics to Judge Models
Performance
Model size
Query execution time
22

Metrics
Performance: ROC AUC
(area under receiver
operating characteristic
curve)
23
http://scikit-learn.org/stable/modules/model_evaluation.html#roc-metrics

Metrics
Model Size: Gauge the feasibility of deploying a model
to a user’s computer
Query Time: Gauge the feasibility of evaluating PE
files as they are written to disk or as users attempt to
run them.
24

Resources
26
http://www-bcf.usc.edu/~gareth/ISL/
http://scikit-learn.org/stable/documentation.html

Nearest Neighbor
27
from sklearn.neighbors import KNeighborsClassifier
tuned_parameters = {'n_neighbors': [1,5,11], 'weights': ['uniform','distance']}
model = bake(KNeighborsClassifier(algorithm='ball_tree'), tuned_parameters, X[ix], y[ix])
pickle.dump(model, open('Bakeoff_kNN.pkl', 'w'))
Introduction to Statistical Learning page 40

Logistic Regression
28
from sklearn.linear_model import SGDClassifier
tuned_parameters = {'alpha':[1e-5,1e-4,1e-3], 'l1_ratio': [0., 0.15, 0.85, 1.0]}
model = bake(SGDClassifier(loss='log', penalty='elasticnet'), tuned_parameters, X, y)
pickle.dump(model, open('Bakeoff_logisticRegression.pkl', 'w'))

Support Vector Machine
29
from sklearn.linear_model import SGDClassifier
tuned_parameters = {'alpha':[1e-5,1e-4,1e-3], 'l1_ratio': [0., 0.15, 0.85, 1.0]}
model = bake(SGDClassifier(loss='hinge', penalty='elasticnet'), tuned_parameters, X, y)
pickle.dump(model, open('Bakeoff_SVM.pkl', 'w'))

Naïve Bayes
30
from sklearn.naive_bayes import GaussianNB
tuned_parameters = {}
model = bake(GaussianNB(), tuned_parameters, X, y)
pickle.dump(model, open('Bakeoff_NB.pkl', 'w'))
http://scikit-learn.org/stable/modules/naive_bayes.html

Random Forest
31
from sklearn.ensemble import RandomForestClassifier
tuned_parameters = {'n_estimators': [20, 50, 100],
'min_samples_split': [2, 5, 10],
'max_features': ["sqrt", .1, 0.2]}
model = bake(RandomForestClassifier(oob_score=False), tuned_parameters, X, y
pickle.dump(R, open('Bakeoff_randomforest.pkl', 'w'))

Gradient Boosted Decision Trees
32
from xgboost import XGBClassifier
tuned_parameters = {'max_depth': [3, 4, 5],
'n_estimators': [20, 50, 100],
'colsample_bytree': [0.9, 1.0]}
model = bake(XGBClassifier(), tuned_parameters, X, y)
pickle.dump(R, open('Bakeoff_xgboost.pkl', 'w'))
http://scikit-learn.org/stable/auto_examples/index.html

Deep Learning
33
Features are fed to a multilayer perceptron with
three hidden layers using dropout and batch
normalization.
http://deeplearning.net/tutorial/mlp.html

34
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers import PReLU, BatchNormalization
model = Sequential()
model.add(Dropout(input_dropout, input_shape=(n_units,)))
model.add( BatchNormalization() )
model.add(Dense(n_units, input_shape=(n_units,)))
model.add(PReLU())
model.add(Dropout(hidden_dropout))
model.add(Dense(n_units))
model.add(PReLU())
model.add(Dense(n_units))
model.add(PReLU())
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Takeaways
We set up a Kaggle competition. xgboost tends to
win these competitions.
42
http://hdl.handle.net/11250/2433761
Available at:
Proposed answer: Newton
boosting instead of MART (multiple
additive regression trees)

But deep learning…
We haven’t yet traveled far enough down the deep
learning path. This does not yet exist for PE files:
43
GoogLeNet Architecture from Inception paper. Available at: https://arxiv.org/abs/1409.4842

44
The most useful features are transparently
available. Decision trees set a very high bar
for deep learning to clear.

45
At Endgame, we do believe there
is discriminating power in the
section data itself.
We are researching the best deep
learning architectures and the
best way to combine the results
with PE header features.

Our Conclusions
46
Let’s run gbdt on the endpoint!
Let’s provide a larger model in the cloud!
Deep learning deserves more research!

Further Reading
47
Examining Malware with Python
https://www.endgame.com/blog/technical-blog/examining-malware-python
Machine Learning
https://www.endgame.com/blog/technical-blog/machine-learning-you-gotta-tame-beast-you-let-it-out-its-cage
It’s a Bake-off!
https://www.endgame.com/blog/technical-blog/its-bake-navigating-evolving-world-machine-learning-models

THANK YOU
proth@endgame.com @mrphilroth

Machine Learning Model Bakeoff

More Related Content

What's hot

Similar to Machine Learning Model Bakeoff

Recently uploaded

Machine Learning Model Bakeoff