Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine Learning Model Bakeoff

897 views

Published on

References and links available here: https://gist.github.com/mrphilroth/854e51e12a1d27d847aac3f153800fa6

Published in: Technology
  • Be the first to comment

Machine Learning Model Bakeoff

  1. 1. MODEL BAKEOFF WHICH MODEL CAME HOT AND FRESH OUT THE KITCHEN IN OUR MALWARE CLASSIFIER BAKEOFF? Data Intelligence Conference 2017 Phil Roth
  2. 2. Phil Roth Data Scientist @mrphilroth proth@endgame.com 2 whoami
  3. 3. 3 whoami Hyrum Anderson @drhyrum Jonathan Woodbridge @jswoodbridge Bobby Filar @filar Phil Roth Data Scientist @mrphilroth proth@endgame.com
  4. 4. Data Science in Security 4 Domain Generation Algorithm (DGA) Protection Malicious File Classification Anomaly Detection Insider Threat Detection Network Intrusion Detection System doable hardly doable etc….
  5. 5. Mission 5 The best minds of my generation are thinking about how to make people click ads. That sucks. - Jeff Hammerbacher https://www.bloomberg.com/news/articles/2011-04-14/this-tech-bubble-is-different
  6. 6. Relevant 6 https://arstechnica.com/ [various]
  7. 7. Challenges 7 Lack of open datasets Labels are expensive to obtain High cost of false positives AND false negatives
  8. 8. Malware Classification Antivirus. As a supervised ML problem. Windows Executables (portable executables or PEs) are sorted into two classes: benign and malicious 8
  9. 9. MalwareScore Static features Deployed to customer machines Available at VirusTotal 9 https://www.virustotal.com/
  10. 10. Why a Model Bakeoff? Decide on an approach 10
  11. 11. Why a Model Bakeoff? Build institutional knowledge of diverse models 11
  12. 12. Why NOT a Model Bakeoff? Incomplete exploration of model design space 12 Mike Bostock. Visualizing Algorithms. https://bost.ocks.org/mike/algorithms/
  13. 13. Why NOT a Model Bakeoff? Setting up the bakeoff is most of the work 13 Highly Scientific and Definitely Not Made Up Data About How Hard Tasks Are
  14. 14. Why NOT a Bakeoff? Optimizing the model is squeezing the last bits of performance. 14 Blue is so much better!!!!
  15. 15. Setup
  16. 16. Data 16 Top 10 Malicious Families MalwareScore is trained on 6M benign and 3M malicious files The bakeoff was carried out on an early subset of those
  17. 17. Features 17 Byte Histogram Sliding Window Byte Entropy PE Information PE Imports PE Exports PE Sections PE Version Information Raw data based features PE Header based features
  18. 18. Features 18 Byte Histogram Sliding Window Byte Entropy 0 3 0 0 0 4 0 0 0 255 255 0 0 184 0 0 0 0 0 0 0 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 216 0 0 77 90 144 14 31 186 14 0 180 9 205 33 184 1 76 205 33 84 104 105 115 32 112 114 111 103 114 97 109 32 99 97 110 110 111 116 32 98 101 32 117 110 32 105 110 32 68 79 83 32 109 111 100 101 46 13 13 10 36 0 0 0 0 0 0 0 49 184 132 58 117 217 234 105 117 217 234 105 11 217 234 105 182 214 181 105 119 217 234 105 117 217 235 105 238 217 234 105 182 214 183 105 100 217 234 105 33 250 218 105 127 217 234 105 178 223 236 105 116 217 234 105 82 105 99 104 117 21 234 105 0 2
  19. 19. Features 19 PE Information PE Imports PE Exports PE Sections PE Version Information
  20. 20. Features 20 PE Information PE Imports PE Exports PE Sections PE Version Information Feature Hashing is applied to each of these feature groups.
  21. 21. Models 21 Model architectures were chosen by team member most knowledgeable Small grid search carried out to find the best parameters
  22. 22. Metrics to Judge Models Performance Model size Query execution time 22
  23. 23. Metrics Performance: ROC AUC (area under receiver operating characteristic curve) 23 http://scikit-learn.org/stable/modules/model_evaluation.html#roc-metrics
  24. 24. Metrics Model Size: Gauge the feasibility of deploying a model to a user’s computer Query Time: Gauge the feasibility of evaluating PE files as they are written to disk or as users attempt to run them. 24
  25. 25. Contenders
  26. 26. Resources 26 http://www-bcf.usc.edu/~gareth/ISL/ http://scikit-learn.org/stable/documentation.html
  27. 27. Nearest Neighbor 27 from sklearn.neighbors import KNeighborsClassifier tuned_parameters = {'n_neighbors': [1,5,11], 'weights': ['uniform','distance']} model = bake(KNeighborsClassifier(algorithm='ball_tree'), tuned_parameters, X[ix], y[ix]) pickle.dump(model, open('Bakeoff_kNN.pkl', 'w')) Introduction to Statistical Learning page 40
  28. 28. Logistic Regression 28 from sklearn.linear_model import SGDClassifier tuned_parameters = {'alpha':[1e-5,1e-4,1e-3], 'l1_ratio': [0., 0.15, 0.85, 1.0]} model = bake(SGDClassifier(loss='log', penalty='elasticnet'), tuned_parameters, X, y) pickle.dump(model, open('Bakeoff_logisticRegression.pkl', 'w')) Introduction to Statistical Learning page 131
  29. 29. Support Vector Machine 29 Introduction to Statistical Learning page 342 from sklearn.linear_model import SGDClassifier tuned_parameters = {'alpha':[1e-5,1e-4,1e-3], 'l1_ratio': [0., 0.15, 0.85, 1.0]} model = bake(SGDClassifier(loss='hinge', penalty='elasticnet'), tuned_parameters, X, y) pickle.dump(model, open('Bakeoff_SVM.pkl', 'w'))
  30. 30. Naïve Bayes 30 from sklearn.naive_bayes import GaussianNB tuned_parameters = {} model = bake(GaussianNB(), tuned_parameters, X, y) pickle.dump(model, open('Bakeoff_NB.pkl', 'w')) http://scikit-learn.org/stable/modules/naive_bayes.html
  31. 31. Random Forest 31 from sklearn.ensemble import RandomForestClassifier tuned_parameters = {'n_estimators': [20, 50, 100], 'min_samples_split': [2, 5, 10], 'max_features': ["sqrt", .1, 0.2]} model = bake(RandomForestClassifier(oob_score=False), tuned_parameters, X, y pickle.dump(R, open('Bakeoff_randomforest.pkl', 'w')) Introduction to Statistical Learning page 308
  32. 32. Gradient Boosted Decision Trees 32 from xgboost import XGBClassifier tuned_parameters = {'max_depth': [3, 4, 5], 'n_estimators': [20, 50, 100], 'colsample_bytree': [0.9, 1.0]} model = bake(XGBClassifier(), tuned_parameters, X, y) pickle.dump(R, open('Bakeoff_xgboost.pkl', 'w')) http://scikit-learn.org/stable/auto_examples/index.html
  33. 33. Deep Learning 33 Features are fed to a multilayer perceptron with three hidden layers using dropout and batch normalization. http://deeplearning.net/tutorial/mlp.html
  34. 34. 34 from keras.models import Sequential from keras.layers.core import Dense, Activation, Dropout from keras.layers import PReLU, BatchNormalization model = Sequential() model.add(Dropout(input_dropout, input_shape=(n_units,))) model.add( BatchNormalization() ) model.add(Dense(n_units, input_shape=(n_units,))) model.add(PReLU()) model.add(Dropout(hidden_dropout)) model.add( BatchNormalization() ) model.add(Dense(n_units)) model.add(PReLU()) model.add(Dropout(hidden_dropout)) model.add( BatchNormalization() ) model.add(Dense(n_units)) model.add(PReLU()) model.add(Dropout(hidden_dropout)) model.add( BatchNormalization() ) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
  35. 35. Results
  36. 36. Results 36
  37. 37. Performance: ROC AUC 37
  38. 38. Model Size 38
  39. 39. Model Size 39
  40. 40. Query Rate 40
  41. 41. Takeaways
  42. 42. Takeaways We set up a Kaggle competition. xgboost tends to win these competitions. 42 http://hdl.handle.net/11250/2433761 Available at: Proposed answer: Newton boosting instead of MART (multiple additive regression trees)
  43. 43. But deep learning… We haven’t yet traveled far enough down the deep learning path. This does not yet exist for PE files: 43 GoogLeNet Architecture from Inception paper. Available at: https://arxiv.org/abs/1409.4842
  44. 44. But deep learning… 44 The most useful features are transparently available. Decision trees set a very high bar for deep learning to clear.
  45. 45. But deep learning… 45 At Endgame, we do believe there is discriminating power in the section data itself. We are researching the best deep learning architectures and the best way to combine the results with PE header features.
  46. 46. Our Conclusions 46 Let’s run gbdt on the endpoint! Let’s provide a larger model in the cloud! Deep learning deserves more research!
  47. 47. Further Reading 47 Examining Malware with Python https://www.endgame.com/blog/technical-blog/examining-malware-python Machine Learning https://www.endgame.com/blog/technical-blog/machine-learning-you-gotta-tame-beast-you-let-it-out-its-cage It’s a Bake-off! https://www.endgame.com/blog/technical-blog/its-bake-navigating-evolving-world-machine-learning-models
  48. 48. THANK YOU proth@endgame.com @mrphilroth

×