SlideShare a Scribd company logo
1 of 43
Download to read offline
ember
an open source
malware classifier and
dataset
whoami
Phil Roth
Data Scientist
@mrphilroth
proth@endgame.com
Learned ML at IceCube
Applying it at Endgame
whoami
Hyrum Anderson
Technical Director of Data Science
@drhyrum
Open datasets push ML research
forward
source: https://twitter.com/benhamner/status/938123380074610688
Datasets cited in NIPS papers over time
One example: MNIST
MNIST: http://yann.lecun.com/exdb/mnist/
Database of 70k (60k/10k
training/test split) images of
handwritten digits
“MNIST is the new unit test” –Ian
Goodfellow
Even when the dataset can no
longer effectively measure
performance improvements, it’s
still useful as a sanity check.
Another example: CIFAR 10/100
CIFAR-10:
Database of 60k (50k/10k training/test
split) images of 10 different classes
CIFAR-100:
60k images of 100 different classes
CIFAR: https://www.cs.toronto.edu/~kriz/cifar.html
Security lacks these datasets
2014 Corporate Blog
2015 RSA FloorTalk
Reasons security lacks these
datasets
Personally identifiable information
Communicating vulnerabilities to attackers
Intellectual property
Existing Security Datasets
http://www.secrepo.com/Mike Sconzo’s
DGA Detection
Domain generation algorithms create large numbers of domain names to serve as
rendezvous for C&C servers.
Datasets available:
AlexaTop 1 Million: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
DGA Archive: https://dgarchive.caad.fkie.fraunhofer.de/
DGA Domains: http://osint.bambenekconsulting.com/feeds/dga-feed.txt
Johannes Bacher's reversing: https://github.com/baderj/domain_generation_algorithms
Network Intrusion Detection
Unsupervised learning problem looking for anomalous network events. (To me, this
turns into an alert ordering problem)
Datasets available:
DARPA Datasets:
https://www.ll.mit.edu//ideval/data/1998data.html
https://www.ll.mit.edu//ideval/data/1999data.html
KDD Cup 1999:
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
OLD!!!!
Static Classification of Malware
Basically the antivirus problem solved with machine learning.
Datasets available:
Drebin [Android]: https://www.sec.cs.tu-bs.de/~danarp/drebin/
VirusShare [Malicious Only]: https://virusshare.com/
Microsoft Malware Challenge [Malicious Only. Headers Stripped]:
https://www.kaggle.com/c/malware-classification
Static Classification of Malware
Benign and malicious samples can
be distributed in a feature space
(using attributes like file size and
number of imports)
Goal is to predict samples that we
haven’t seen yet
Static Classification of Malware
AYARA rule can divide these two
classes. But a simple rule won’t be
generalizable.
Static Classification of Malware
A machine learning model can
define a better boundary that
makes more accurate predictions
There are so many options for
machine learning algorithms. How
do we know which one is best?
Endgame Malware BEnchmark for Research
“MNIST for malware”
ember
“I know... But, if I tried to avoid
the name of every Javascript
framework, there wouldn’t be
any names left.”
Endgame Malware BEnchmark for Research
An open source collection of 1.1 million PE File sha256 hashes that were
scanned by VirusTotal sometime in 2017.
The dataset includes metadata, derived features from the PE files, a model
trained on those features, and accompanying code.
It does NOT include the files themselves.
ember
The dataset is divided into a 900k training set and a
200k testing set
Training set includes 300k of benign, malicious, and
unlabeled samples
data
Training set data appears
chronologically prior to the test data
Date metadata allows:
• Chronological cross validation
• Quantifying model performance
degradation over time
train test
data
7 JSON line files containing extracted features
data
[proth@proth-mbp data]$ ls -lh ember_dataset.tar.bz2
-rw-r--r-- 1 proth staff 1.6G Apr 5 11:38 ember_dataset.tar.bz2
[proth@proth-mbp data]$ cd ember
[proth@proth-mbp ember]$ ls -lh
total 9.2G
-rw-r--r-- 1 proth staff 1.6G Apr 6 16:03 test_features.jsonl
-rw-r--r-- 1 proth staff 426M Apr 6 16:03 train_features_0.jsonl
-rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_1.jsonl
-rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_2.jsonl
-rw-r--r-- 1 proth staff 1.4G Apr 6 16:03 train_features_3.jsonl
-rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_4.jsonl
-rw-r--r-- 1 proth staff 1.4G Apr 6 16:03 train_features_5.jsonl
First three keys of each line is metadata
data
[proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "." | head -n 4
{
"sha256": "0abb4fda7d5b13801d63bee53e5e256be43e141faa077a6d149874242c3f02c2",
"appeared": "2006-12",
"label": 0,
The rest of the keys are feature categories
data
[proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "del(.sha256,
.appeared, .label)" | jq "keys"
[
"byteentropy",
"exports",
"general",
"header",
"histogram",
"imports",
"section",
"strings"
]
features
Two kinds of features:
Calculated from raw bytes
Calculated from lief parsing
the PE file format
https://lief.quarkslab.com/
https://lief.quarkslab.com/doc/Intro.html
https://github.com/lief-project/LIEF
features
Raw features are calculated from
the bytes and the lief object
Vectorized features are calculated
from the raw features
features
• Byte Histogram (histogram)
A simple counting of how many times each byte occurs
• Byte Entropy Histogram (byteentropy)
Sliding window entropy calculation
Details in Section 2.1.1: [Saxe, Berlin 2015] https://arxiv.org/pdf/1508.03096.pdf
features
• Section Information (section)
Entry section and a list of all sections with name, size, entropy, and other information given
given for each
features
• Import Information (imports)
Each library imported from along with imported function names
• Export Information (exports)
Exported function names
features
• String Information (strings)
Number of strings, average length, character histogram, number of strings that
match various patterns like URLs, MZ header, or registry keys
features
• General Information (general)
Number of imports, exports, symbols and whether the file has relocations,
resources, or a signature
features
• Header Information (header)
Details about the machine the file was compiled on. Versions of linkers, images,
and operating system. etc…
vectorization
After downloading the dataset, feature vectorization is a necessary
step before model training
The ember codebase defines how each feature is hashed into a
vector using scikit-learn tools (FeatureHasher function)
Feature vectorizing took 20 hours on my 2015 MacBook Pro i7
model
Gradient Boosted DecisionTree model trained with
LightGBM on labeled samples
Model training took 3 hours on my 2015 MacBook
Pro i7
import lightgbm as lgb
X_train, y_train = read_vectorized_features(data_dir, subset="train”)
train_rows = (y_train != -1)
lgbm_dataset = lgb.Dataset(X_train[train_rows], y_train[train_rows])
lgbm_model = lgb.train({"application": "binary"}, lgbm_dataset)
model
Ember Model Performance:
ROC AUC: 0.9991123269999999
Threshold: 0.871
False Positive Rate: 0.099%
False Negative Rate: 7.009%
Detection Rate: 92.991%
disclaimer
This model is NOT MalwareScore
MalwareScore:
is better optimized
has better features
performs better
is constantly updated with new data
is the best option for protecting your endpoints (in my totally biased opinion)
code
https://github.com/endgameinc/ember
The ember repo makes
it easy to:
• Vectorize features
• Train the model
• Make predictions on
new PE files
notebook
The Jupyter notebook will
reproduce the graphics from
this talk from the extracted
dataset
suggestions
To beat the benchmark model performance:
Use feature selection techniques to eliminate misleading features
Do feature engineering to find better features
Optimize LightGBM model parameters with grid search
Incorporate information from unlabeled samples into training
suggestions
To further research in the field of ML for static malware
detection:
Quantify model performance degradation through time
Build and compare the performance of featureless neural network
based models (need independent access to samples)
An adversarial network could create or modify PE files to bypass
ember model classification
demo time!
ember
Highlight: “Evidently, despite increased model size and computational
burden, featureless deep learning models have yet to eclipse the
performance of models that leverage domain knowledge via parsed
features.”
Read the paper:
https://arxiv.org/abs/1804.04637
ember
Download the data:
https://pubdata.endgame.com/ember/ember_dataset.tar.bz2
Download the code:
https://github.com/endgameinc/ember
THANKYOU!
Phil Roth: @mrphilroth Hyrum Anderson: @drhyrum

More Related Content

What's hot

Unsupervised learning represenation with DCGAN
Unsupervised learning represenation with DCGANUnsupervised learning represenation with DCGAN
Unsupervised learning represenation with DCGANShyam Krishna Khadka
 
Image forgery detection using error level analysis and deep learning
Image forgery detection using error level analysis and deep learningImage forgery detection using error level analysis and deep learning
Image forgery detection using error level analysis and deep learningTELKOMNIKA JOURNAL
 
AGE AND GENDER DETECTION.pptx
AGE AND GENDER DETECTION.pptxAGE AND GENDER DETECTION.pptx
AGE AND GENDER DETECTION.pptxssuserb4a9ba
 
Image Processing with OpenCV
Image Processing with OpenCVImage Processing with OpenCV
Image Processing with OpenCVdebayanin
 
Face recognization using artificial nerual network
Face recognization using artificial nerual networkFace recognization using artificial nerual network
Face recognization using artificial nerual networkDharmesh Tank
 
Moving object detection in video surveillance
Moving object detection in video surveillanceMoving object detection in video surveillance
Moving object detection in video surveillanceAshfaqul Haque John
 
Object detection - RCNNs vs Retinanet
Object detection - RCNNs vs RetinanetObject detection - RCNNs vs Retinanet
Object detection - RCNNs vs RetinanetRishabh Indoria
 
Detecting malaria using a deep convolutional neural network
Detecting malaria using a deep  convolutional neural networkDetecting malaria using a deep  convolutional neural network
Detecting malaria using a deep convolutional neural networkYusuf Brima
 
Supervised and Unsupervised Machine Learning
Supervised and Unsupervised Machine LearningSupervised and Unsupervised Machine Learning
Supervised and Unsupervised Machine LearningSpotle.ai
 
Face Detection techniques
Face Detection techniquesFace Detection techniques
Face Detection techniquesAbhineet Bhamra
 
Object detection and Instance Segmentation
Object detection and Instance SegmentationObject detection and Instance Segmentation
Object detection and Instance SegmentationHichem Felouat
 
Lecture 4 neural networks
Lecture 4 neural networksLecture 4 neural networks
Lecture 4 neural networksParveenMalik18
 
Active learning lecture
Active learning lectureActive learning lecture
Active learning lectureazuring
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset PreparationAndrew Ferlitsch
 
Image analysis using python
Image analysis using pythonImage analysis using python
Image analysis using pythonJerlyn Manohar
 
Object Detection Classification, tracking and Counting
Object Detection Classification, tracking and CountingObject Detection Classification, tracking and Counting
Object Detection Classification, tracking and CountingShounak Mitra
 

What's hot (20)

Unsupervised learning represenation with DCGAN
Unsupervised learning represenation with DCGANUnsupervised learning represenation with DCGAN
Unsupervised learning represenation with DCGAN
 
Image forgery detection using error level analysis and deep learning
Image forgery detection using error level analysis and deep learningImage forgery detection using error level analysis and deep learning
Image forgery detection using error level analysis and deep learning
 
AGE AND GENDER DETECTION.pptx
AGE AND GENDER DETECTION.pptxAGE AND GENDER DETECTION.pptx
AGE AND GENDER DETECTION.pptx
 
Age and Gender Detection.docx
Age and Gender Detection.docxAge and Gender Detection.docx
Age and Gender Detection.docx
 
Image Processing with OpenCV
Image Processing with OpenCVImage Processing with OpenCV
Image Processing with OpenCV
 
Face recognization using artificial nerual network
Face recognization using artificial nerual networkFace recognization using artificial nerual network
Face recognization using artificial nerual network
 
svm classification
svm classificationsvm classification
svm classification
 
Moving object detection in video surveillance
Moving object detection in video surveillanceMoving object detection in video surveillance
Moving object detection in video surveillance
 
Transfer Learning
Transfer LearningTransfer Learning
Transfer Learning
 
Object detection - RCNNs vs Retinanet
Object detection - RCNNs vs RetinanetObject detection - RCNNs vs Retinanet
Object detection - RCNNs vs Retinanet
 
Detecting malaria using a deep convolutional neural network
Detecting malaria using a deep  convolutional neural networkDetecting malaria using a deep  convolutional neural network
Detecting malaria using a deep convolutional neural network
 
Supervised and Unsupervised Machine Learning
Supervised and Unsupervised Machine LearningSupervised and Unsupervised Machine Learning
Supervised and Unsupervised Machine Learning
 
Face Detection techniques
Face Detection techniquesFace Detection techniques
Face Detection techniques
 
Object detection and Instance Segmentation
Object detection and Instance SegmentationObject detection and Instance Segmentation
Object detection and Instance Segmentation
 
CIFAR-10
CIFAR-10CIFAR-10
CIFAR-10
 
Lecture 4 neural networks
Lecture 4 neural networksLecture 4 neural networks
Lecture 4 neural networks
 
Active learning lecture
Active learning lectureActive learning lecture
Active learning lecture
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
 
Image analysis using python
Image analysis using pythonImage analysis using python
Image analysis using python
 
Object Detection Classification, tracking and Counting
Object Detection Classification, tracking and CountingObject Detection Classification, tracking and Counting
Object Detection Classification, tracking and Counting
 

Similar to Ember

PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019Masashi Shibata
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah
 
Machine Learning Model Bakeoff
Machine Learning Model BakeoffMachine Learning Model Bakeoff
Machine Learning Model Bakeoffmrphilroth
 
Machine learning key to your formulation challenges
Machine learning key to your formulation challengesMachine learning key to your formulation challenges
Machine learning key to your formulation challengesMarc Borowczak
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Databricks
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaChetan Khatri
 
Hadoop cluster performance profiler
Hadoop cluster performance profilerHadoop cluster performance profiler
Hadoop cluster performance profilerIhor Bobak
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupJim Dowling
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material Bryan Yang
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Sease
 
Towards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoTowards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoDatabricks
 
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Provectus
 
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...Amazon Web Services Korea
 
Automate Machine Learning Pipeline Using MLBox
Automate Machine Learning Pipeline Using MLBoxAutomate Machine Learning Pipeline Using MLBox
Automate Machine Learning Pipeline Using MLBoxAxel de Romblay
 

Similar to Ember (20)

PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
MLBox
MLBoxMLBox
MLBox
 
MLBox 0.8.2
MLBox 0.8.2 MLBox 0.8.2
MLBox 0.8.2
 
Machine Learning Model Bakeoff
Machine Learning Model BakeoffMachine Learning Model Bakeoff
Machine Learning Model Bakeoff
 
Machine learning key to your formulation challenges
Machine learning key to your formulation challengesMachine learning key to your formulation challenges
Machine learning key to your formulation challenges
 
OpenML 2019
OpenML 2019OpenML 2019
OpenML 2019
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
Hadoop cluster performance profiler
Hadoop cluster performance profilerHadoop cluster performance profiler
Hadoop cluster performance profiler
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
 
Deploying Machine Learning Models to Production
Deploying Machine Learning Models to ProductionDeploying Machine Learning Models to Production
Deploying Machine Learning Models to Production
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
 
Towards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoTowards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei Diao
 
CSL0777-L07.pptx
CSL0777-L07.pptxCSL0777-L07.pptx
CSL0777-L07.pptx
 
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
 
Monitoring AI with AI
Monitoring AI with AIMonitoring AI with AI
Monitoring AI with AI
 
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
 
Automate Machine Learning Pipeline Using MLBox
Automate Machine Learning Pipeline Using MLBoxAutomate Machine Learning Pipeline Using MLBox
Automate Machine Learning Pipeline Using MLBox
 

Recently uploaded

Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfROWELL MARQUINA
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
HCI Lesson 1 - Introduction to Human-Computer Interaction.pdf
HCI Lesson 1 - Introduction to Human-Computer Interaction.pdfHCI Lesson 1 - Introduction to Human-Computer Interaction.pdf
HCI Lesson 1 - Introduction to Human-Computer Interaction.pdfROWELL MARQUINA
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
A PowerPoint Presentation on Vikram Lander pptx
A PowerPoint Presentation on Vikram Lander pptxA PowerPoint Presentation on Vikram Lander pptx
A PowerPoint Presentation on Vikram Lander pptxatharvdev2010
 
Deliver Latency Free Customer Experience
Deliver Latency Free Customer ExperienceDeliver Latency Free Customer Experience
Deliver Latency Free Customer ExperienceOpsTree solutions
 
Transcript: Green paths: Learning from publishers’ sustainability journeys - ...
Transcript: Green paths: Learning from publishers’ sustainability journeys - ...Transcript: Green paths: Learning from publishers’ sustainability journeys - ...
Transcript: Green paths: Learning from publishers’ sustainability journeys - ...BookNet Canada
 
Dublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptxDublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptxKunal Gupta
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024BookNet Canada
 
Women in Automation 2024: Career session - explore career paths in automation
Women in Automation 2024: Career session - explore career paths in automationWomen in Automation 2024: Career session - explore career paths in automation
Women in Automation 2024: Career session - explore career paths in automationDianaGray10
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 

Recently uploaded (20)

Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdf
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
HCI Lesson 1 - Introduction to Human-Computer Interaction.pdf
HCI Lesson 1 - Introduction to Human-Computer Interaction.pdfHCI Lesson 1 - Introduction to Human-Computer Interaction.pdf
HCI Lesson 1 - Introduction to Human-Computer Interaction.pdf
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
A PowerPoint Presentation on Vikram Lander pptx
A PowerPoint Presentation on Vikram Lander pptxA PowerPoint Presentation on Vikram Lander pptx
A PowerPoint Presentation on Vikram Lander pptx
 
Deliver Latency Free Customer Experience
Deliver Latency Free Customer ExperienceDeliver Latency Free Customer Experience
Deliver Latency Free Customer Experience
 
Transcript: Green paths: Learning from publishers’ sustainability journeys - ...
Transcript: Green paths: Learning from publishers’ sustainability journeys - ...Transcript: Green paths: Learning from publishers’ sustainability journeys - ...
Transcript: Green paths: Learning from publishers’ sustainability journeys - ...
 
Dublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptxDublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
 
Women in Automation 2024: Career session - explore career paths in automation
Women in Automation 2024: Career session - explore career paths in automationWomen in Automation 2024: Career session - explore career paths in automation
Women in Automation 2024: Career session - explore career paths in automation
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 

Ember

  • 1. ember an open source malware classifier and dataset
  • 3. whoami Hyrum Anderson Technical Director of Data Science @drhyrum
  • 4. Open datasets push ML research forward source: https://twitter.com/benhamner/status/938123380074610688 Datasets cited in NIPS papers over time
  • 5. One example: MNIST MNIST: http://yann.lecun.com/exdb/mnist/ Database of 70k (60k/10k training/test split) images of handwritten digits “MNIST is the new unit test” –Ian Goodfellow Even when the dataset can no longer effectively measure performance improvements, it’s still useful as a sanity check.
  • 6. Another example: CIFAR 10/100 CIFAR-10: Database of 60k (50k/10k training/test split) images of 10 different classes CIFAR-100: 60k images of 100 different classes CIFAR: https://www.cs.toronto.edu/~kriz/cifar.html
  • 7. Security lacks these datasets 2014 Corporate Blog 2015 RSA FloorTalk
  • 8. Reasons security lacks these datasets Personally identifiable information Communicating vulnerabilities to attackers Intellectual property
  • 10. DGA Detection Domain generation algorithms create large numbers of domain names to serve as rendezvous for C&C servers. Datasets available: AlexaTop 1 Million: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip DGA Archive: https://dgarchive.caad.fkie.fraunhofer.de/ DGA Domains: http://osint.bambenekconsulting.com/feeds/dga-feed.txt Johannes Bacher's reversing: https://github.com/baderj/domain_generation_algorithms
  • 11. Network Intrusion Detection Unsupervised learning problem looking for anomalous network events. (To me, this turns into an alert ordering problem) Datasets available: DARPA Datasets: https://www.ll.mit.edu//ideval/data/1998data.html https://www.ll.mit.edu//ideval/data/1999data.html KDD Cup 1999: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html OLD!!!!
  • 12. Static Classification of Malware Basically the antivirus problem solved with machine learning. Datasets available: Drebin [Android]: https://www.sec.cs.tu-bs.de/~danarp/drebin/ VirusShare [Malicious Only]: https://virusshare.com/ Microsoft Malware Challenge [Malicious Only. Headers Stripped]: https://www.kaggle.com/c/malware-classification
  • 13. Static Classification of Malware Benign and malicious samples can be distributed in a feature space (using attributes like file size and number of imports) Goal is to predict samples that we haven’t seen yet
  • 14. Static Classification of Malware AYARA rule can divide these two classes. But a simple rule won’t be generalizable.
  • 15. Static Classification of Malware A machine learning model can define a better boundary that makes more accurate predictions There are so many options for machine learning algorithms. How do we know which one is best?
  • 16. Endgame Malware BEnchmark for Research “MNIST for malware” ember
  • 17. “I know... But, if I tried to avoid the name of every Javascript framework, there wouldn’t be any names left.”
  • 18. Endgame Malware BEnchmark for Research An open source collection of 1.1 million PE File sha256 hashes that were scanned by VirusTotal sometime in 2017. The dataset includes metadata, derived features from the PE files, a model trained on those features, and accompanying code. It does NOT include the files themselves. ember
  • 19. The dataset is divided into a 900k training set and a 200k testing set Training set includes 300k of benign, malicious, and unlabeled samples data
  • 20. Training set data appears chronologically prior to the test data Date metadata allows: • Chronological cross validation • Quantifying model performance degradation over time train test data
  • 21. 7 JSON line files containing extracted features data [proth@proth-mbp data]$ ls -lh ember_dataset.tar.bz2 -rw-r--r-- 1 proth staff 1.6G Apr 5 11:38 ember_dataset.tar.bz2 [proth@proth-mbp data]$ cd ember [proth@proth-mbp ember]$ ls -lh total 9.2G -rw-r--r-- 1 proth staff 1.6G Apr 6 16:03 test_features.jsonl -rw-r--r-- 1 proth staff 426M Apr 6 16:03 train_features_0.jsonl -rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_1.jsonl -rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_2.jsonl -rw-r--r-- 1 proth staff 1.4G Apr 6 16:03 train_features_3.jsonl -rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_4.jsonl -rw-r--r-- 1 proth staff 1.4G Apr 6 16:03 train_features_5.jsonl
  • 22. First three keys of each line is metadata data [proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "." | head -n 4 { "sha256": "0abb4fda7d5b13801d63bee53e5e256be43e141faa077a6d149874242c3f02c2", "appeared": "2006-12", "label": 0,
  • 23. The rest of the keys are feature categories data [proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "del(.sha256, .appeared, .label)" | jq "keys" [ "byteentropy", "exports", "general", "header", "histogram", "imports", "section", "strings" ]
  • 24. features Two kinds of features: Calculated from raw bytes Calculated from lief parsing the PE file format https://lief.quarkslab.com/ https://lief.quarkslab.com/doc/Intro.html https://github.com/lief-project/LIEF
  • 25. features Raw features are calculated from the bytes and the lief object Vectorized features are calculated from the raw features
  • 26. features • Byte Histogram (histogram) A simple counting of how many times each byte occurs • Byte Entropy Histogram (byteentropy) Sliding window entropy calculation Details in Section 2.1.1: [Saxe, Berlin 2015] https://arxiv.org/pdf/1508.03096.pdf
  • 27. features • Section Information (section) Entry section and a list of all sections with name, size, entropy, and other information given given for each
  • 28. features • Import Information (imports) Each library imported from along with imported function names • Export Information (exports) Exported function names
  • 29. features • String Information (strings) Number of strings, average length, character histogram, number of strings that match various patterns like URLs, MZ header, or registry keys
  • 30. features • General Information (general) Number of imports, exports, symbols and whether the file has relocations, resources, or a signature
  • 31. features • Header Information (header) Details about the machine the file was compiled on. Versions of linkers, images, and operating system. etc…
  • 32. vectorization After downloading the dataset, feature vectorization is a necessary step before model training The ember codebase defines how each feature is hashed into a vector using scikit-learn tools (FeatureHasher function) Feature vectorizing took 20 hours on my 2015 MacBook Pro i7
  • 33. model Gradient Boosted DecisionTree model trained with LightGBM on labeled samples Model training took 3 hours on my 2015 MacBook Pro i7 import lightgbm as lgb X_train, y_train = read_vectorized_features(data_dir, subset="train”) train_rows = (y_train != -1) lgbm_dataset = lgb.Dataset(X_train[train_rows], y_train[train_rows]) lgbm_model = lgb.train({"application": "binary"}, lgbm_dataset)
  • 34. model Ember Model Performance: ROC AUC: 0.9991123269999999 Threshold: 0.871 False Positive Rate: 0.099% False Negative Rate: 7.009% Detection Rate: 92.991%
  • 35. disclaimer This model is NOT MalwareScore MalwareScore: is better optimized has better features performs better is constantly updated with new data is the best option for protecting your endpoints (in my totally biased opinion)
  • 36. code https://github.com/endgameinc/ember The ember repo makes it easy to: • Vectorize features • Train the model • Make predictions on new PE files
  • 37. notebook The Jupyter notebook will reproduce the graphics from this talk from the extracted dataset
  • 38. suggestions To beat the benchmark model performance: Use feature selection techniques to eliminate misleading features Do feature engineering to find better features Optimize LightGBM model parameters with grid search Incorporate information from unlabeled samples into training
  • 39. suggestions To further research in the field of ML for static malware detection: Quantify model performance degradation through time Build and compare the performance of featureless neural network based models (need independent access to samples) An adversarial network could create or modify PE files to bypass ember model classification
  • 41. ember Highlight: “Evidently, despite increased model size and computational burden, featureless deep learning models have yet to eclipse the performance of models that leverage domain knowledge via parsed features.” Read the paper: https://arxiv.org/abs/1804.04637
  • 42.
  • 43. ember Download the data: https://pubdata.endgame.com/ember/ember_dataset.tar.bz2 Download the code: https://github.com/endgameinc/ember THANKYOU! Phil Roth: @mrphilroth Hyrum Anderson: @drhyrum