SlideShare a Scribd company logo
MODEL BAKEOFF
WHICH MODEL CAME HOT AND FRESH OUT THE
KITCHEN IN OUR MALWARE CLASSIFIER
BAKEOFF?
Data Intelligence Conference 2017
Phil Roth
Phil Roth
Data Scientist
@mrphilroth
proth@endgame.com
2
whoami
3
whoami
Hyrum Anderson
@drhyrum
Jonathan Woodbridge
@jswoodbridge
Bobby Filar
@filar
Phil Roth
Data Scientist
@mrphilroth
proth@endgame.com
Data Science in Security
4
Domain Generation Algorithm (DGA) Protection
Malicious File Classification
Anomaly Detection
Insider Threat Detection
Network Intrusion Detection System
doable
hardly doable
etc….
Mission
5
The best minds of my generation
are thinking about how to make
people click ads. That sucks.
- Jeff Hammerbacher
https://www.bloomberg.com/news/articles/2011-04-14/this-tech-bubble-is-different
Relevant
6
https://arstechnica.com/ [various]
Challenges
7
Lack of open datasets
Labels are expensive to obtain
High cost of false positives AND false negatives
Malware Classification
Antivirus. As a supervised ML
problem.
Windows Executables
(portable executables or PEs)
are sorted into two classes:
benign and malicious
8
MalwareScore
Static features
Deployed to customer
machines
Available at VirusTotal
9
https://www.virustotal.com/
Why a Model Bakeoff?
Decide on an approach
10
Why a Model Bakeoff?
Build institutional knowledge of diverse models
11
Why NOT a Model Bakeoff?
Incomplete exploration of model design space
12
Mike Bostock. Visualizing Algorithms. https://bost.ocks.org/mike/algorithms/
Why NOT a Model Bakeoff?
Setting up the bakeoff is most of the work
13
Highly Scientific and Definitely Not Made Up Data About How Hard Tasks Are
Why NOT a Bakeoff?
Optimizing the model
is squeezing the last
bits of performance.
14
Blue is so much better!!!!
Setup
Data
16
Top 10 Malicious Families
MalwareScore is trained on 6M benign and 3M malicious files
The bakeoff was carried out on an early subset of those
Features
17
Byte Histogram
Sliding Window Byte Entropy
PE Information
PE Imports
PE Exports
PE Sections
PE Version Information
Raw data based features
PE Header based features
Features
18
Byte Histogram
Sliding Window Byte Entropy
0 3 0 0 0 4 0 0 0 255 255 0 0 184 0 0 0 0 0 0 0 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 216 0 0 77 90 144 14 31 186 14 0 180 9
205 33 184 1 76 205 33 84 104 105 115 32 112 114 111 103 114 97 109 32 99
97 110 110 111 116 32 98 101 32 117 110 32 105 110 32 68 79 83 32 109 111
100 101 46 13 13 10 36 0 0 0 0 0 0 0 49 184 132 58 117 217 234 105 117 217
234 105 11 217 234 105 182 214 181 105 119 217 234 105 117 217 235 105
238 217 234 105 182 214 183 105 100 217 234 105 33 250 218 105 127 217
234 105 178 223 236 105 116 217 234 105 82 105 99 104 117 21 234 105 0 2
Features
19
PE Information
PE Imports
PE Exports
PE Sections
PE Version Information
Features
20
PE Information
PE Imports
PE Exports
PE Sections
PE Version Information
Feature Hashing is applied
to each of these feature
groups.
Models
21
Model architectures were chosen by team member
most knowledgeable
Small grid search carried out to find the best
parameters
Metrics to Judge Models
Performance
Model size
Query execution time
22
Metrics
Performance: ROC AUC
(area under receiver
operating characteristic
curve)
23
http://scikit-learn.org/stable/modules/model_evaluation.html#roc-metrics
Metrics
Model Size: Gauge the feasibility of deploying a model
to a user’s computer
Query Time: Gauge the feasibility of evaluating PE
files as they are written to disk or as users attempt to
run them.
24
Contenders
Resources
26
http://www-bcf.usc.edu/~gareth/ISL/
http://scikit-learn.org/stable/documentation.html
Nearest Neighbor
27
from sklearn.neighbors import KNeighborsClassifier
tuned_parameters = {'n_neighbors': [1,5,11], 'weights': ['uniform','distance']}
model = bake(KNeighborsClassifier(algorithm='ball_tree'), tuned_parameters, X[ix], y[ix])
pickle.dump(model, open('Bakeoff_kNN.pkl', 'w'))
Introduction to Statistical Learning page 40
Logistic Regression
28
from sklearn.linear_model import SGDClassifier
tuned_parameters = {'alpha':[1e-5,1e-4,1e-3], 'l1_ratio': [0., 0.15, 0.85, 1.0]}
model = bake(SGDClassifier(loss='log', penalty='elasticnet'), tuned_parameters, X, y)
pickle.dump(model, open('Bakeoff_logisticRegression.pkl', 'w'))
Introduction to Statistical Learning page 131
Support Vector Machine
29
Introduction to Statistical Learning page 342
from sklearn.linear_model import SGDClassifier
tuned_parameters = {'alpha':[1e-5,1e-4,1e-3], 'l1_ratio': [0., 0.15, 0.85, 1.0]}
model = bake(SGDClassifier(loss='hinge', penalty='elasticnet'), tuned_parameters, X, y)
pickle.dump(model, open('Bakeoff_SVM.pkl', 'w'))
Naïve Bayes
30
from sklearn.naive_bayes import GaussianNB
tuned_parameters = {}
model = bake(GaussianNB(), tuned_parameters, X, y)
pickle.dump(model, open('Bakeoff_NB.pkl', 'w'))
http://scikit-learn.org/stable/modules/naive_bayes.html
Random Forest
31
from sklearn.ensemble import RandomForestClassifier
tuned_parameters = {'n_estimators': [20, 50, 100],
'min_samples_split': [2, 5, 10],
'max_features': ["sqrt", .1, 0.2]}
model = bake(RandomForestClassifier(oob_score=False), tuned_parameters, X, y
pickle.dump(R, open('Bakeoff_randomforest.pkl', 'w'))
Introduction to Statistical Learning page 308
Gradient Boosted Decision Trees
32
from xgboost import XGBClassifier
tuned_parameters = {'max_depth': [3, 4, 5],
'n_estimators': [20, 50, 100],
'colsample_bytree': [0.9, 1.0]}
model = bake(XGBClassifier(), tuned_parameters, X, y)
pickle.dump(R, open('Bakeoff_xgboost.pkl', 'w'))
http://scikit-learn.org/stable/auto_examples/index.html
Deep Learning
33
Features are fed to a multilayer perceptron with
three hidden layers using dropout and batch
normalization.
http://deeplearning.net/tutorial/mlp.html
34
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers import PReLU, BatchNormalization
model = Sequential()
model.add(Dropout(input_dropout, input_shape=(n_units,)))
model.add( BatchNormalization() )
model.add(Dense(n_units, input_shape=(n_units,)))
model.add(PReLU())
model.add(Dropout(hidden_dropout))
model.add( BatchNormalization() )
model.add(Dense(n_units))
model.add(PReLU())
model.add(Dropout(hidden_dropout))
model.add( BatchNormalization() )
model.add(Dense(n_units))
model.add(PReLU())
model.add(Dropout(hidden_dropout))
model.add( BatchNormalization() )
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
Results
Results
36
Performance: ROC AUC
37
Model Size
38
Model Size
39
Query Rate
40
Takeaways
Takeaways
We set up a Kaggle competition. xgboost tends to
win these competitions.
42
http://hdl.handle.net/11250/2433761
Available at:
Proposed answer: Newton
boosting instead of MART (multiple
additive regression trees)
But deep learning…
We haven’t yet traveled far enough down the deep
learning path. This does not yet exist for PE files:
43
GoogLeNet Architecture from Inception paper. Available at: https://arxiv.org/abs/1409.4842
But deep learning…
44
The most useful features are transparently
available. Decision trees set a very high bar
for deep learning to clear.
But deep learning…
45
At Endgame, we do believe there
is discriminating power in the
section data itself.
We are researching the best deep
learning architectures and the
best way to combine the results
with PE header features.
Our Conclusions
46
Let’s run gbdt on the endpoint!
Let’s provide a larger model in the cloud!
Deep learning deserves more research!
Further Reading
47
Examining Malware with Python
https://www.endgame.com/blog/technical-blog/examining-malware-python
Machine Learning
https://www.endgame.com/blog/technical-blog/machine-learning-you-gotta-tame-beast-you-let-it-out-its-cage
It’s a Bake-off!
https://www.endgame.com/blog/technical-blog/its-bake-navigating-evolving-world-machine-learning-models
THANK YOU
proth@endgame.com @mrphilroth

More Related Content

What's hot

Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
Sages
 

What's hot (20)

딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
 
Use C++ to Manipulate mozSettings in Gecko
Use C++ to Manipulate mozSettings in GeckoUse C++ to Manipulate mozSettings in Gecko
Use C++ to Manipulate mozSettings in Gecko
 
The Ring programming language version 1.5.4 book - Part 10 of 185
The Ring programming language version 1.5.4 book - Part 10 of 185The Ring programming language version 1.5.4 book - Part 10 of 185
The Ring programming language version 1.5.4 book - Part 10 of 185
 
Dealing with combinatorial explosions and boring tests
Dealing with combinatorial explosions and boring testsDealing with combinatorial explosions and boring tests
Dealing with combinatorial explosions and boring tests
 
Java practice programs for beginners
Java practice programs for beginnersJava practice programs for beginners
Java practice programs for beginners
 
Nodejs性能分析优化和分布式设计探讨
Nodejs性能分析优化和分布式设计探讨Nodejs性能分析优化和分布式设计探讨
Nodejs性能分析优化和分布式设计探讨
 
Java Performance: Speedup your application with hardware counters
Java Performance: Speedup your application with hardware countersJava Performance: Speedup your application with hardware counters
Java Performance: Speedup your application with hardware counters
 
Pandas+postgre sql 實作 with code
Pandas+postgre sql 實作 with codePandas+postgre sql 實作 with code
Pandas+postgre sql 實作 with code
 
The Language for future-julia
The Language for future-juliaThe Language for future-julia
The Language for future-julia
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNet
 
Testability for Developers
Testability for DevelopersTestability for Developers
Testability for Developers
 
Welcome to python
Welcome to pythonWelcome to python
Welcome to python
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API Basics
 
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
 
Testing a 2D Platformer with Spock
Testing a 2D Platformer with SpockTesting a 2D Platformer with Spock
Testing a 2D Platformer with Spock
 
JPoint 2016 - Валеев Тагир - Странности Stream API
JPoint 2016 - Валеев Тагир - Странности Stream APIJPoint 2016 - Валеев Тагир - Странности Stream API
JPoint 2016 - Валеев Тагир - Странности Stream API
 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced
 
Where the wild things are - Benchmarking and Micro-Optimisations
Where the wild things are - Benchmarking and Micro-OptimisationsWhere the wild things are - Benchmarking and Micro-Optimisations
Where the wild things are - Benchmarking and Micro-Optimisations
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
 
Performance and how to measure it - ProgSCon London 2016
Performance and how to measure it - ProgSCon London 2016Performance and how to measure it - ProgSCon London 2016
Performance and how to measure it - ProgSCon London 2016
 

Similar to Machine Learning Model Bakeoff

Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use case
inovex GmbH
 
Apache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise CompanyApache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise Company
Databricks
 
Generatingcharacterizationtestsforlegacycode
GeneratingcharacterizationtestsforlegacycodeGeneratingcharacterizationtestsforlegacycode
Generatingcharacterizationtestsforlegacycode
Carl Schrammel
 

Similar to Machine Learning Model Bakeoff (20)

Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection Process
 
Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]
 
230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptx230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptx
 
Faster deep learning solutions from training to inference - Amitai Armon & Ni...
Faster deep learning solutions from training to inference - Amitai Armon & Ni...Faster deep learning solutions from training to inference - Amitai Armon & Ni...
Faster deep learning solutions from training to inference - Amitai Armon & Ni...
 
Machine Learning Model for M.S admissions
Machine Learning Model for M.S admissionsMachine Learning Model for M.S admissions
Machine Learning Model for M.S admissions
 
Flock: Data Science Platform @ CISL
Flock: Data Science Platform @ CISLFlock: Data Science Platform @ CISL
Flock: Data Science Platform @ CISL
 
Learning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleLearning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and Kaggle
 
Deep learning in manufacturing predicting and preventing manufacturing defect...
Deep learning in manufacturing predicting and preventing manufacturing defect...Deep learning in manufacturing predicting and preventing manufacturing defect...
Deep learning in manufacturing predicting and preventing manufacturing defect...
 
Model-Based Design & Analysis.ppt
Model-Based Design & Analysis.pptModel-Based Design & Analysis.ppt
Model-Based Design & Analysis.ppt
 
Rise of the Machines: Can Artificial Intelligence Terminate Manual Testing?
Rise of the Machines: Can Artificial Intelligence Terminate Manual Testing?Rise of the Machines: Can Artificial Intelligence Terminate Manual Testing?
Rise of the Machines: Can Artificial Intelligence Terminate Manual Testing?
 
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use case
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use case
 
Apache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise CompanyApache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise Company
 
From Black Box to Black Magic, Pycon Ireland 2014
From Black Box to Black Magic, Pycon Ireland 2014From Black Box to Black Magic, Pycon Ireland 2014
From Black Box to Black Magic, Pycon Ireland 2014
 
Py conie 2014
Py conie 2014Py conie 2014
Py conie 2014
 
Deploying ML models to production (frequently and safely) - PYCON 2018
Deploying ML models to production (frequently and safely) - PYCON 2018Deploying ML models to production (frequently and safely) - PYCON 2018
Deploying ML models to production (frequently and safely) - PYCON 2018
 
Generatingcharacterizationtestsforlegacycode
GeneratingcharacterizationtestsforlegacycodeGeneratingcharacterizationtestsforlegacycode
Generatingcharacterizationtestsforlegacycode
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
 
GlobalLogic Test Automation Online TechTalk “Test Driven Development as a Per...
GlobalLogic Test Automation Online TechTalk “Test Driven Development as a Per...GlobalLogic Test Automation Online TechTalk “Test Driven Development as a Per...
GlobalLogic Test Automation Online TechTalk “Test Driven Development as a Per...
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 

Recently uploaded (20)

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 

Machine Learning Model Bakeoff

  • 1. MODEL BAKEOFF WHICH MODEL CAME HOT AND FRESH OUT THE KITCHEN IN OUR MALWARE CLASSIFIER BAKEOFF? Data Intelligence Conference 2017 Phil Roth
  • 3. 3 whoami Hyrum Anderson @drhyrum Jonathan Woodbridge @jswoodbridge Bobby Filar @filar Phil Roth Data Scientist @mrphilroth proth@endgame.com
  • 4. Data Science in Security 4 Domain Generation Algorithm (DGA) Protection Malicious File Classification Anomaly Detection Insider Threat Detection Network Intrusion Detection System doable hardly doable etc….
  • 5. Mission 5 The best minds of my generation are thinking about how to make people click ads. That sucks. - Jeff Hammerbacher https://www.bloomberg.com/news/articles/2011-04-14/this-tech-bubble-is-different
  • 7. Challenges 7 Lack of open datasets Labels are expensive to obtain High cost of false positives AND false negatives
  • 8. Malware Classification Antivirus. As a supervised ML problem. Windows Executables (portable executables or PEs) are sorted into two classes: benign and malicious 8
  • 9. MalwareScore Static features Deployed to customer machines Available at VirusTotal 9 https://www.virustotal.com/
  • 10. Why a Model Bakeoff? Decide on an approach 10
  • 11. Why a Model Bakeoff? Build institutional knowledge of diverse models 11
  • 12. Why NOT a Model Bakeoff? Incomplete exploration of model design space 12 Mike Bostock. Visualizing Algorithms. https://bost.ocks.org/mike/algorithms/
  • 13. Why NOT a Model Bakeoff? Setting up the bakeoff is most of the work 13 Highly Scientific and Definitely Not Made Up Data About How Hard Tasks Are
  • 14. Why NOT a Bakeoff? Optimizing the model is squeezing the last bits of performance. 14 Blue is so much better!!!!
  • 15. Setup
  • 16. Data 16 Top 10 Malicious Families MalwareScore is trained on 6M benign and 3M malicious files The bakeoff was carried out on an early subset of those
  • 17. Features 17 Byte Histogram Sliding Window Byte Entropy PE Information PE Imports PE Exports PE Sections PE Version Information Raw data based features PE Header based features
  • 18. Features 18 Byte Histogram Sliding Window Byte Entropy 0 3 0 0 0 4 0 0 0 255 255 0 0 184 0 0 0 0 0 0 0 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 216 0 0 77 90 144 14 31 186 14 0 180 9 205 33 184 1 76 205 33 84 104 105 115 32 112 114 111 103 114 97 109 32 99 97 110 110 111 116 32 98 101 32 117 110 32 105 110 32 68 79 83 32 109 111 100 101 46 13 13 10 36 0 0 0 0 0 0 0 49 184 132 58 117 217 234 105 117 217 234 105 11 217 234 105 182 214 181 105 119 217 234 105 117 217 235 105 238 217 234 105 182 214 183 105 100 217 234 105 33 250 218 105 127 217 234 105 178 223 236 105 116 217 234 105 82 105 99 104 117 21 234 105 0 2
  • 19. Features 19 PE Information PE Imports PE Exports PE Sections PE Version Information
  • 20. Features 20 PE Information PE Imports PE Exports PE Sections PE Version Information Feature Hashing is applied to each of these feature groups.
  • 21. Models 21 Model architectures were chosen by team member most knowledgeable Small grid search carried out to find the best parameters
  • 22. Metrics to Judge Models Performance Model size Query execution time 22
  • 23. Metrics Performance: ROC AUC (area under receiver operating characteristic curve) 23 http://scikit-learn.org/stable/modules/model_evaluation.html#roc-metrics
  • 24. Metrics Model Size: Gauge the feasibility of deploying a model to a user’s computer Query Time: Gauge the feasibility of evaluating PE files as they are written to disk or as users attempt to run them. 24
  • 27. Nearest Neighbor 27 from sklearn.neighbors import KNeighborsClassifier tuned_parameters = {'n_neighbors': [1,5,11], 'weights': ['uniform','distance']} model = bake(KNeighborsClassifier(algorithm='ball_tree'), tuned_parameters, X[ix], y[ix]) pickle.dump(model, open('Bakeoff_kNN.pkl', 'w')) Introduction to Statistical Learning page 40
  • 28. Logistic Regression 28 from sklearn.linear_model import SGDClassifier tuned_parameters = {'alpha':[1e-5,1e-4,1e-3], 'l1_ratio': [0., 0.15, 0.85, 1.0]} model = bake(SGDClassifier(loss='log', penalty='elasticnet'), tuned_parameters, X, y) pickle.dump(model, open('Bakeoff_logisticRegression.pkl', 'w')) Introduction to Statistical Learning page 131
  • 29. Support Vector Machine 29 Introduction to Statistical Learning page 342 from sklearn.linear_model import SGDClassifier tuned_parameters = {'alpha':[1e-5,1e-4,1e-3], 'l1_ratio': [0., 0.15, 0.85, 1.0]} model = bake(SGDClassifier(loss='hinge', penalty='elasticnet'), tuned_parameters, X, y) pickle.dump(model, open('Bakeoff_SVM.pkl', 'w'))
  • 30. Naïve Bayes 30 from sklearn.naive_bayes import GaussianNB tuned_parameters = {} model = bake(GaussianNB(), tuned_parameters, X, y) pickle.dump(model, open('Bakeoff_NB.pkl', 'w')) http://scikit-learn.org/stable/modules/naive_bayes.html
  • 31. Random Forest 31 from sklearn.ensemble import RandomForestClassifier tuned_parameters = {'n_estimators': [20, 50, 100], 'min_samples_split': [2, 5, 10], 'max_features': ["sqrt", .1, 0.2]} model = bake(RandomForestClassifier(oob_score=False), tuned_parameters, X, y pickle.dump(R, open('Bakeoff_randomforest.pkl', 'w')) Introduction to Statistical Learning page 308
  • 32. Gradient Boosted Decision Trees 32 from xgboost import XGBClassifier tuned_parameters = {'max_depth': [3, 4, 5], 'n_estimators': [20, 50, 100], 'colsample_bytree': [0.9, 1.0]} model = bake(XGBClassifier(), tuned_parameters, X, y) pickle.dump(R, open('Bakeoff_xgboost.pkl', 'w')) http://scikit-learn.org/stable/auto_examples/index.html
  • 33. Deep Learning 33 Features are fed to a multilayer perceptron with three hidden layers using dropout and batch normalization. http://deeplearning.net/tutorial/mlp.html
  • 34. 34 from keras.models import Sequential from keras.layers.core import Dense, Activation, Dropout from keras.layers import PReLU, BatchNormalization model = Sequential() model.add(Dropout(input_dropout, input_shape=(n_units,))) model.add( BatchNormalization() ) model.add(Dense(n_units, input_shape=(n_units,))) model.add(PReLU()) model.add(Dropout(hidden_dropout)) model.add( BatchNormalization() ) model.add(Dense(n_units)) model.add(PReLU()) model.add(Dropout(hidden_dropout)) model.add( BatchNormalization() ) model.add(Dense(n_units)) model.add(PReLU()) model.add(Dropout(hidden_dropout)) model.add( BatchNormalization() ) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
  • 42. Takeaways We set up a Kaggle competition. xgboost tends to win these competitions. 42 http://hdl.handle.net/11250/2433761 Available at: Proposed answer: Newton boosting instead of MART (multiple additive regression trees)
  • 43. But deep learning… We haven’t yet traveled far enough down the deep learning path. This does not yet exist for PE files: 43 GoogLeNet Architecture from Inception paper. Available at: https://arxiv.org/abs/1409.4842
  • 44. But deep learning… 44 The most useful features are transparently available. Decision trees set a very high bar for deep learning to clear.
  • 45. But deep learning… 45 At Endgame, we do believe there is discriminating power in the section data itself. We are researching the best deep learning architectures and the best way to combine the results with PE header features.
  • 46. Our Conclusions 46 Let’s run gbdt on the endpoint! Let’s provide a larger model in the cloud! Deep learning deserves more research!
  • 47. Further Reading 47 Examining Malware with Python https://www.endgame.com/blog/technical-blog/examining-malware-python Machine Learning https://www.endgame.com/blog/technical-blog/machine-learning-you-gotta-tame-beast-you-let-it-out-its-cage It’s a Bake-off! https://www.endgame.com/blog/technical-blog/its-bake-navigating-evolving-world-machine-learning-models