Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

An open source malware dataset and classifier

Published in: Technology
  • Be the first to comment


  1. 1. ember an open source malware classifier and dataset
  2. 2. whoami Phil Roth Data Scientist @mrphilroth Learned ML at IceCube Applying it at Endgame
  3. 3. whoami Hyrum Anderson Technical Director of Data Science @drhyrum
  4. 4. Open datasets push ML research forward source: Datasets cited in NIPS papers over time
  5. 5. One example: MNIST MNIST: Database of 70k (60k/10k training/test split) images of handwritten digits “MNIST is the new unit test” –Ian Goodfellow Even when the dataset can no longer effectively measure performance improvements, it’s still useful as a sanity check.
  6. 6. Another example: CIFAR 10/100 CIFAR-10: Database of 60k (50k/10k training/test split) images of 10 different classes CIFAR-100: 60k images of 100 different classes CIFAR:
  7. 7. Security lacks these datasets 2014 Corporate Blog 2015 RSA FloorTalk
  8. 8. Reasons security lacks these datasets Personally identifiable information Communicating vulnerabilities to attackers Intellectual property
  9. 9. Existing Security Datasets Sconzo’s
  10. 10. DGA Detection Domain generation algorithms create large numbers of domain names to serve as rendezvous for C&C servers. Datasets available: AlexaTop 1 Million: DGA Archive: DGA Domains: Johannes Bacher's reversing:
  11. 11. Network Intrusion Detection Unsupervised learning problem looking for anomalous network events. (To me, this turns into an alert ordering problem) Datasets available: DARPA Datasets: KDD Cup 1999: OLD!!!!
  12. 12. Static Classification of Malware Basically the antivirus problem solved with machine learning. Datasets available: Drebin [Android]: VirusShare [Malicious Only]: Microsoft Malware Challenge [Malicious Only. Headers Stripped]:
  13. 13. Static Classification of Malware Benign and malicious samples can be distributed in a feature space (using attributes like file size and number of imports) Goal is to predict samples that we haven’t seen yet
  14. 14. Static Classification of Malware AYARA rule can divide these two classes. But a simple rule won’t be generalizable.
  15. 15. Static Classification of Malware A machine learning model can define a better boundary that makes more accurate predictions There are so many options for machine learning algorithms. How do we know which one is best?
  16. 16. Endgame Malware BEnchmark for Research “MNIST for malware” ember
  17. 17. “I know... But, if I tried to avoid the name of every Javascript framework, there wouldn’t be any names left.”
  18. 18. Endgame Malware BEnchmark for Research An open source collection of 1.1 million PE File sha256 hashes that were scanned by VirusTotal sometime in 2017. The dataset includes metadata, derived features from the PE files, a model trained on those features, and accompanying code. It does NOT include the files themselves. ember
  19. 19. The dataset is divided into a 900k training set and a 200k testing set Training set includes 300k of benign, malicious, and unlabeled samples data
  20. 20. Training set data appears chronologically prior to the test data Date metadata allows: • Chronological cross validation • Quantifying model performance degradation over time train test data
  21. 21. 7 JSON line files containing extracted features data [proth@proth-mbp data]$ ls -lh ember_dataset.tar.bz2 -rw-r--r-- 1 proth staff 1.6G Apr 5 11:38 ember_dataset.tar.bz2 [proth@proth-mbp data]$ cd ember [proth@proth-mbp ember]$ ls -lh total 9.2G -rw-r--r-- 1 proth staff 1.6G Apr 6 16:03 test_features.jsonl -rw-r--r-- 1 proth staff 426M Apr 6 16:03 train_features_0.jsonl -rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_1.jsonl -rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_2.jsonl -rw-r--r-- 1 proth staff 1.4G Apr 6 16:03 train_features_3.jsonl -rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_4.jsonl -rw-r--r-- 1 proth staff 1.4G Apr 6 16:03 train_features_5.jsonl
  22. 22. First three keys of each line is metadata data [proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "." | head -n 4 { "sha256": "0abb4fda7d5b13801d63bee53e5e256be43e141faa077a6d149874242c3f02c2", "appeared": "2006-12", "label": 0,
  23. 23. The rest of the keys are feature categories data [proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "del(.sha256, .appeared, .label)" | jq "keys" [ "byteentropy", "exports", "general", "header", "histogram", "imports", "section", "strings" ]
  24. 24. features Two kinds of features: Calculated from raw bytes Calculated from lief parsing the PE file format
  25. 25. features Raw features are calculated from the bytes and the lief object Vectorized features are calculated from the raw features
  26. 26. features • Byte Histogram (histogram) A simple counting of how many times each byte occurs • Byte Entropy Histogram (byteentropy) Sliding window entropy calculation Details in Section 2.1.1: [Saxe, Berlin 2015]
  27. 27. features • Section Information (section) Entry section and a list of all sections with name, size, entropy, and other information given given for each
  28. 28. features • Import Information (imports) Each library imported from along with imported function names • Export Information (exports) Exported function names
  29. 29. features • String Information (strings) Number of strings, average length, character histogram, number of strings that match various patterns like URLs, MZ header, or registry keys
  30. 30. features • General Information (general) Number of imports, exports, symbols and whether the file has relocations, resources, or a signature
  31. 31. features • Header Information (header) Details about the machine the file was compiled on. Versions of linkers, images, and operating system. etc…
  32. 32. vectorization After downloading the dataset, feature vectorization is a necessary step before model training The ember codebase defines how each feature is hashed into a vector using scikit-learn tools (FeatureHasher function) Feature vectorizing took 20 hours on my 2015 MacBook Pro i7
  33. 33. model Gradient Boosted DecisionTree model trained with LightGBM on labeled samples Model training took 3 hours on my 2015 MacBook Pro i7 import lightgbm as lgb X_train, y_train = read_vectorized_features(data_dir, subset="train”) train_rows = (y_train != -1) lgbm_dataset = lgb.Dataset(X_train[train_rows], y_train[train_rows]) lgbm_model = lgb.train({"application": "binary"}, lgbm_dataset)
  34. 34. model Ember Model Performance: ROC AUC: 0.9991123269999999 Threshold: 0.871 False Positive Rate: 0.099% False Negative Rate: 7.009% Detection Rate: 92.991%
  35. 35. disclaimer This model is NOT MalwareScore MalwareScore: is better optimized has better features performs better is constantly updated with new data is the best option for protecting your endpoints (in my totally biased opinion)
  36. 36. code The ember repo makes it easy to: • Vectorize features • Train the model • Make predictions on new PE files
  37. 37. notebook The Jupyter notebook will reproduce the graphics from this talk from the extracted dataset
  38. 38. suggestions To beat the benchmark model performance: Use feature selection techniques to eliminate misleading features Do feature engineering to find better features Optimize LightGBM model parameters with grid search Incorporate information from unlabeled samples into training
  39. 39. suggestions To further research in the field of ML for static malware detection: Quantify model performance degradation through time Build and compare the performance of featureless neural network based models (need independent access to samples) An adversarial network could create or modify PE files to bypass ember model classification
  40. 40. demo time!
  41. 41. ember Highlight: “Evidently, despite increased model size and computational burden, featureless deep learning models have yet to eclipse the performance of models that leverage domain knowledge via parsed features.” Read the paper:
  42. 42. ember Download the data: Download the code: THANKYOU! Phil Roth: @mrphilroth Hyrum Anderson: @drhyrum