Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
ember
an open source
malware classifier and
dataset
whoami
Phil Roth
Data Scientist
@mrphilroth
proth@endgame.com
Learned ML at IceCube
Applying it at Endgame
whoami
Hyrum Anderson
Technical Director of Data Science
@drhyrum
Open datasets push ML research
forward
source: https://twitter.com/benhamner/status/938123380074610688
Datasets cited in N...
One example: MNIST
MNIST: http://yann.lecun.com/exdb/mnist/
Database of 70k (60k/10k
training/test split) images of
handwr...
Another example: CIFAR 10/100
CIFAR-10:
Database of 60k (50k/10k training/test
split) images of 10 different classes
CIFAR...
Security lacks these datasets
2014 Corporate Blog
2015 RSA FloorTalk
Reasons security lacks these
datasets
Personally identifiable information
Communicating vulnerabilities to attackers
Intel...
Existing Security Datasets
http://www.secrepo.com/Mike Sconzo’s
DGA Detection
Domain generation algorithms create large numbers of domain names to serve as
rendezvous for C&C servers.
Da...
Network Intrusion Detection
Unsupervised learning problem looking for anomalous network events. (To me, this
turns into an...
Static Classification of Malware
Basically the antivirus problem solved with machine learning.
Datasets available:
Drebin ...
Static Classification of Malware
Benign and malicious samples can
be distributed in a feature space
(using attributes like...
Static Classification of Malware
AYARA rule can divide these two
classes. But a simple rule won’t be
generalizable.
Static Classification of Malware
A machine learning model can
define a better boundary that
makes more accurate prediction...
Endgame Malware BEnchmark for Research
“MNIST for malware”
ember
“I know... But, if I tried to avoid
the name of every Javascript
framework, there wouldn’t be
any names left.”
Endgame Malware BEnchmark for Research
An open source collection of 1.1 million PE File sha256 hashes that were
scanned by...
The dataset is divided into a 900k training set and a
200k testing set
Training set includes 300k of benign, malicious, an...
Training set data appears
chronologically prior to the test data
Date metadata allows:
• Chronological cross validation
• ...
7 JSON line files containing extracted features
data
[proth@proth-mbp data]$ ls -lh ember_dataset.tar.bz2
-rw-r--r-- 1 pro...
First three keys of each line is metadata
data
[proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "." | head -...
The rest of the keys are feature categories
data
[proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "del(.sha2...
features
Two kinds of features:
Calculated from raw bytes
Calculated from lief parsing
the PE file format
https://lief.qua...
features
Raw features are calculated from
the bytes and the lief object
Vectorized features are calculated
from the raw fe...
features
• Byte Histogram (histogram)
A simple counting of how many times each byte occurs
• Byte Entropy Histogram (bytee...
features
• Section Information (section)
Entry section and a list of all sections with name, size, entropy, and other info...
features
• Import Information (imports)
Each library imported from along with imported function names
• Export Information...
features
• String Information (strings)
Number of strings, average length, character histogram, number of strings that
mat...
features
• General Information (general)
Number of imports, exports, symbols and whether the file has relocations,
resourc...
features
• Header Information (header)
Details about the machine the file was compiled on. Versions of linkers, images,
an...
vectorization
After downloading the dataset, feature vectorization is a necessary
step before model training
The ember cod...
model
Gradient Boosted DecisionTree model trained with
LightGBM on labeled samples
Model training took 3 hours on my 2015 ...
model
Ember Model Performance:
ROC AUC: 0.9991123269999999
Threshold: 0.871
False Positive Rate: 0.099%
False Negative Rat...
disclaimer
This model is NOT MalwareScore
MalwareScore:
is better optimized
has better features
performs better
is constan...
code
https://github.com/endgameinc/ember
The ember repo makes
it easy to:
• Vectorize features
• Train the model
• Make pr...
notebook
The Jupyter notebook will
reproduce the graphics from
this talk from the extracted
dataset
suggestions
To beat the benchmark model performance:
Use feature selection techniques to eliminate misleading features
Do ...
suggestions
To further research in the field of ML for static malware
detection:
Quantify model performance degradation th...
demo time!
ember
Highlight: “Evidently, despite increased model size and computational
burden, featureless deep learning models have ...
ember
Download the data:
https://pubdata.endgame.com/ember/ember_dataset.tar.bz2
Download the code:
https://github.com/end...
Ember
Upcoming SlideShare
Loading in …5
×

Ember

997 views

Published on

An open source malware dataset and classifier

Published in: Technology
  • Visit this site: tinyurl.com/sexinarea and find sex in your area for one night)) You can find me on this site too)
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Sex in your area for one night is there tinyurl.com/hotsexinarea Copy and paste link in your browser to visit a site)
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Girls for sex are waiting for you https://bit.ly/2TQ8UAY
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Meetings for sex in your area are there: https://bit.ly/2TQ8UAY
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Ember

  1. 1. ember an open source malware classifier and dataset
  2. 2. whoami Phil Roth Data Scientist @mrphilroth proth@endgame.com Learned ML at IceCube Applying it at Endgame
  3. 3. whoami Hyrum Anderson Technical Director of Data Science @drhyrum
  4. 4. Open datasets push ML research forward source: https://twitter.com/benhamner/status/938123380074610688 Datasets cited in NIPS papers over time
  5. 5. One example: MNIST MNIST: http://yann.lecun.com/exdb/mnist/ Database of 70k (60k/10k training/test split) images of handwritten digits “MNIST is the new unit test” –Ian Goodfellow Even when the dataset can no longer effectively measure performance improvements, it’s still useful as a sanity check.
  6. 6. Another example: CIFAR 10/100 CIFAR-10: Database of 60k (50k/10k training/test split) images of 10 different classes CIFAR-100: 60k images of 100 different classes CIFAR: https://www.cs.toronto.edu/~kriz/cifar.html
  7. 7. Security lacks these datasets 2014 Corporate Blog 2015 RSA FloorTalk
  8. 8. Reasons security lacks these datasets Personally identifiable information Communicating vulnerabilities to attackers Intellectual property
  9. 9. Existing Security Datasets http://www.secrepo.com/Mike Sconzo’s
  10. 10. DGA Detection Domain generation algorithms create large numbers of domain names to serve as rendezvous for C&C servers. Datasets available: AlexaTop 1 Million: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip DGA Archive: https://dgarchive.caad.fkie.fraunhofer.de/ DGA Domains: http://osint.bambenekconsulting.com/feeds/dga-feed.txt Johannes Bacher's reversing: https://github.com/baderj/domain_generation_algorithms
  11. 11. Network Intrusion Detection Unsupervised learning problem looking for anomalous network events. (To me, this turns into an alert ordering problem) Datasets available: DARPA Datasets: https://www.ll.mit.edu//ideval/data/1998data.html https://www.ll.mit.edu//ideval/data/1999data.html KDD Cup 1999: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html OLD!!!!
  12. 12. Static Classification of Malware Basically the antivirus problem solved with machine learning. Datasets available: Drebin [Android]: https://www.sec.cs.tu-bs.de/~danarp/drebin/ VirusShare [Malicious Only]: https://virusshare.com/ Microsoft Malware Challenge [Malicious Only. Headers Stripped]: https://www.kaggle.com/c/malware-classification
  13. 13. Static Classification of Malware Benign and malicious samples can be distributed in a feature space (using attributes like file size and number of imports) Goal is to predict samples that we haven’t seen yet
  14. 14. Static Classification of Malware AYARA rule can divide these two classes. But a simple rule won’t be generalizable.
  15. 15. Static Classification of Malware A machine learning model can define a better boundary that makes more accurate predictions There are so many options for machine learning algorithms. How do we know which one is best?
  16. 16. Endgame Malware BEnchmark for Research “MNIST for malware” ember
  17. 17. “I know... But, if I tried to avoid the name of every Javascript framework, there wouldn’t be any names left.”
  18. 18. Endgame Malware BEnchmark for Research An open source collection of 1.1 million PE File sha256 hashes that were scanned by VirusTotal sometime in 2017. The dataset includes metadata, derived features from the PE files, a model trained on those features, and accompanying code. It does NOT include the files themselves. ember
  19. 19. The dataset is divided into a 900k training set and a 200k testing set Training set includes 300k of benign, malicious, and unlabeled samples data
  20. 20. Training set data appears chronologically prior to the test data Date metadata allows: • Chronological cross validation • Quantifying model performance degradation over time train test data
  21. 21. 7 JSON line files containing extracted features data [proth@proth-mbp data]$ ls -lh ember_dataset.tar.bz2 -rw-r--r-- 1 proth staff 1.6G Apr 5 11:38 ember_dataset.tar.bz2 [proth@proth-mbp data]$ cd ember [proth@proth-mbp ember]$ ls -lh total 9.2G -rw-r--r-- 1 proth staff 1.6G Apr 6 16:03 test_features.jsonl -rw-r--r-- 1 proth staff 426M Apr 6 16:03 train_features_0.jsonl -rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_1.jsonl -rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_2.jsonl -rw-r--r-- 1 proth staff 1.4G Apr 6 16:03 train_features_3.jsonl -rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_4.jsonl -rw-r--r-- 1 proth staff 1.4G Apr 6 16:03 train_features_5.jsonl
  22. 22. First three keys of each line is metadata data [proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "." | head -n 4 { "sha256": "0abb4fda7d5b13801d63bee53e5e256be43e141faa077a6d149874242c3f02c2", "appeared": "2006-12", "label": 0,
  23. 23. The rest of the keys are feature categories data [proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "del(.sha256, .appeared, .label)" | jq "keys" [ "byteentropy", "exports", "general", "header", "histogram", "imports", "section", "strings" ]
  24. 24. features Two kinds of features: Calculated from raw bytes Calculated from lief parsing the PE file format https://lief.quarkslab.com/ https://lief.quarkslab.com/doc/Intro.html https://github.com/lief-project/LIEF
  25. 25. features Raw features are calculated from the bytes and the lief object Vectorized features are calculated from the raw features
  26. 26. features • Byte Histogram (histogram) A simple counting of how many times each byte occurs • Byte Entropy Histogram (byteentropy) Sliding window entropy calculation Details in Section 2.1.1: [Saxe, Berlin 2015] https://arxiv.org/pdf/1508.03096.pdf
  27. 27. features • Section Information (section) Entry section and a list of all sections with name, size, entropy, and other information given given for each
  28. 28. features • Import Information (imports) Each library imported from along with imported function names • Export Information (exports) Exported function names
  29. 29. features • String Information (strings) Number of strings, average length, character histogram, number of strings that match various patterns like URLs, MZ header, or registry keys
  30. 30. features • General Information (general) Number of imports, exports, symbols and whether the file has relocations, resources, or a signature
  31. 31. features • Header Information (header) Details about the machine the file was compiled on. Versions of linkers, images, and operating system. etc…
  32. 32. vectorization After downloading the dataset, feature vectorization is a necessary step before model training The ember codebase defines how each feature is hashed into a vector using scikit-learn tools (FeatureHasher function) Feature vectorizing took 20 hours on my 2015 MacBook Pro i7
  33. 33. model Gradient Boosted DecisionTree model trained with LightGBM on labeled samples Model training took 3 hours on my 2015 MacBook Pro i7 import lightgbm as lgb X_train, y_train = read_vectorized_features(data_dir, subset="train”) train_rows = (y_train != -1) lgbm_dataset = lgb.Dataset(X_train[train_rows], y_train[train_rows]) lgbm_model = lgb.train({"application": "binary"}, lgbm_dataset)
  34. 34. model Ember Model Performance: ROC AUC: 0.9991123269999999 Threshold: 0.871 False Positive Rate: 0.099% False Negative Rate: 7.009% Detection Rate: 92.991%
  35. 35. disclaimer This model is NOT MalwareScore MalwareScore: is better optimized has better features performs better is constantly updated with new data is the best option for protecting your endpoints (in my totally biased opinion)
  36. 36. code https://github.com/endgameinc/ember The ember repo makes it easy to: • Vectorize features • Train the model • Make predictions on new PE files
  37. 37. notebook The Jupyter notebook will reproduce the graphics from this talk from the extracted dataset
  38. 38. suggestions To beat the benchmark model performance: Use feature selection techniques to eliminate misleading features Do feature engineering to find better features Optimize LightGBM model parameters with grid search Incorporate information from unlabeled samples into training
  39. 39. suggestions To further research in the field of ML for static malware detection: Quantify model performance degradation through time Build and compare the performance of featureless neural network based models (need independent access to samples) An adversarial network could create or modify PE files to bypass ember model classification
  40. 40. demo time!
  41. 41. ember Highlight: “Evidently, despite increased model size and computational burden, featureless deep learning models have yet to eclipse the performance of models that leverage domain knowledge via parsed features.” Read the paper: https://arxiv.org/abs/1804.04637
  42. 42. ember Download the data: https://pubdata.endgame.com/ember/ember_dataset.tar.bz2 Download the code: https://github.com/endgameinc/ember THANKYOU! Phil Roth: @mrphilroth Hyrum Anderson: @drhyrum

×