SlideShare a Scribd company logo

Machine Learning Model Bakeoff

Phil Roth presented the results of a malware classification model bakeoff between several machine learning algorithms. The models evaluated were k-nearest neighbors, logistic regression, support vector machines, naive Bayes, random forests, gradient boosted decision trees, and deep learning. Based on the performance, size, and query time metrics, gradient boosted decision trees had the best overall results. However, the presenter noted that deep learning approaches deserve more research due to their potential to learn directly from file content. The conclusions were that gradient boosted decision trees could be deployed to endpoints while larger deep learning models could be used in the cloud after further development.

1 of 48
Download to read offline
MODEL BAKEOFF
WHICH MODEL CAME HOT AND FRESH OUT THE
KITCHEN IN OUR MALWARE CLASSIFIER
BAKEOFF?
Data Intelligence Conference 2017
Phil Roth
Phil Roth
Data Scientist
@mrphilroth
proth@endgame.com
2
whoami
3
whoami
Hyrum Anderson
@drhyrum
Jonathan Woodbridge
@jswoodbridge
Bobby Filar
@filar
Phil Roth
Data Scientist
@mrphilroth
proth@endgame.com
Data Science in Security
4
Domain Generation Algorithm (DGA) Protection
Malicious File Classification
Anomaly Detection
Insider Threat Detection
Network Intrusion Detection System
doable
hardly doable
etc….
Mission
5
The best minds of my generation
are thinking about how to make
people click ads. That sucks.
- Jeff Hammerbacher
https://www.bloomberg.com/news/articles/2011-04-14/this-tech-bubble-is-different
Relevant
6
https://arstechnica.com/ [various]

Recommended

Time Series Analysis for Network Secruity
Time Series Analysis for Network SecruityTime Series Analysis for Network Secruity
Time Series Analysis for Network Secruitymrphilroth
 
JVM Mechanics: Understanding the JIT's Tricks
JVM Mechanics: Understanding the JIT's TricksJVM Mechanics: Understanding the JIT's Tricks
JVM Mechanics: Understanding the JIT's TricksDoug Hawkins
 
Java Performance Puzzlers
Java Performance PuzzlersJava Performance Puzzlers
Java Performance PuzzlersDoug Hawkins
 
Java_practical_handbook
Java_practical_handbookJava_practical_handbook
Java_practical_handbookManusha Dilan
 
The Ring programming language version 1.5.3 book - Part 10 of 184
The Ring programming language version 1.5.3 book - Part 10 of 184The Ring programming language version 1.5.3 book - Part 10 of 184
The Ring programming language version 1.5.3 book - Part 10 of 184Mahmoud Samir Fayed
 

More Related Content

What's hot

딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)Hansol Kang
 
Use C++ to Manipulate mozSettings in Gecko
Use C++ to Manipulate mozSettings in GeckoUse C++ to Manipulate mozSettings in Gecko
Use C++ to Manipulate mozSettings in GeckoChih-Hsuan Kuo
 
The Ring programming language version 1.5.4 book - Part 10 of 185
The Ring programming language version 1.5.4 book - Part 10 of 185The Ring programming language version 1.5.4 book - Part 10 of 185
The Ring programming language version 1.5.4 book - Part 10 of 185Mahmoud Samir Fayed
 
Dealing with combinatorial explosions and boring tests
Dealing with combinatorial explosions and boring testsDealing with combinatorial explosions and boring tests
Dealing with combinatorial explosions and boring testsAlexander Tarlinder
 
Java practice programs for beginners
Java practice programs for beginnersJava practice programs for beginners
Java practice programs for beginnersishan0019
 
Nodejs性能分析优化和分布式设计探讨
Nodejs性能分析优化和分布式设计探讨Nodejs性能分析优化和分布式设计探讨
Nodejs性能分析优化和分布式设计探讨flyinweb
 
Java Performance: Speedup your application with hardware counters
Java Performance: Speedup your application with hardware countersJava Performance: Speedup your application with hardware counters
Java Performance: Speedup your application with hardware countersSergey Kuksenko
 
Pandas+postgre sql 實作 with code
Pandas+postgre sql 實作 with codePandas+postgre sql 實作 with code
Pandas+postgre sql 實作 with codeTim Hong
 
The Language for future-julia
The Language for future-juliaThe Language for future-julia
The Language for future-julia岳華 杜
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNetAI Frontiers
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsFlink Forward
 
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016Comsysto Reply GmbH
 
Testing a 2D Platformer with Spock
Testing a 2D Platformer with SpockTesting a 2D Platformer with Spock
Testing a 2D Platformer with SpockAlexander Tarlinder
 
JPoint 2016 - Валеев Тагир - Странности Stream API
JPoint 2016 - Валеев Тагир - Странности Stream APIJPoint 2016 - Валеев Тагир - Странности Stream API
JPoint 2016 - Валеев Тагир - Странности Stream APItvaleev
 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Flink Forward
 
Where the wild things are - Benchmarking and Micro-Optimisations
Where the wild things are - Benchmarking and Micro-OptimisationsWhere the wild things are - Benchmarking and Micro-Optimisations
Where the wild things are - Benchmarking and Micro-OptimisationsMatt Warren
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopSages
 
Performance and how to measure it - ProgSCon London 2016
Performance and how to measure it - ProgSCon London 2016Performance and how to measure it - ProgSCon London 2016
Performance and how to measure it - ProgSCon London 2016Matt Warren
 

What's hot (20)

딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
 
Use C++ to Manipulate mozSettings in Gecko
Use C++ to Manipulate mozSettings in GeckoUse C++ to Manipulate mozSettings in Gecko
Use C++ to Manipulate mozSettings in Gecko
 
The Ring programming language version 1.5.4 book - Part 10 of 185
The Ring programming language version 1.5.4 book - Part 10 of 185The Ring programming language version 1.5.4 book - Part 10 of 185
The Ring programming language version 1.5.4 book - Part 10 of 185
 
Dealing with combinatorial explosions and boring tests
Dealing with combinatorial explosions and boring testsDealing with combinatorial explosions and boring tests
Dealing with combinatorial explosions and boring tests
 
Java practice programs for beginners
Java practice programs for beginnersJava practice programs for beginners
Java practice programs for beginners
 
Nodejs性能分析优化和分布式设计探讨
Nodejs性能分析优化和分布式设计探讨Nodejs性能分析优化和分布式设计探讨
Nodejs性能分析优化和分布式设计探讨
 
Java Performance: Speedup your application with hardware counters
Java Performance: Speedup your application with hardware countersJava Performance: Speedup your application with hardware counters
Java Performance: Speedup your application with hardware counters
 
Pandas+postgre sql 實作 with code
Pandas+postgre sql 實作 with codePandas+postgre sql 實作 with code
Pandas+postgre sql 實作 with code
 
The Language for future-julia
The Language for future-juliaThe Language for future-julia
The Language for future-julia
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNet
 
Testability for Developers
Testability for DevelopersTestability for Developers
Testability for Developers
 
Welcome to python
Welcome to pythonWelcome to python
Welcome to python
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API Basics
 
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
 
Testing a 2D Platformer with Spock
Testing a 2D Platformer with SpockTesting a 2D Platformer with Spock
Testing a 2D Platformer with Spock
 
JPoint 2016 - Валеев Тагир - Странности Stream API
JPoint 2016 - Валеев Тагир - Странности Stream APIJPoint 2016 - Валеев Тагир - Странности Stream API
JPoint 2016 - Валеев Тагир - Странности Stream API
 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced
 
Where the wild things are - Benchmarking and Micro-Optimisations
Where the wild things are - Benchmarking and Micro-OptimisationsWhere the wild things are - Benchmarking and Micro-Optimisations
Where the wild things are - Benchmarking and Micro-Optimisations
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
 
Performance and how to measure it - ProgSCon London 2016
Performance and how to measure it - ProgSCon London 2016Performance and how to measure it - ProgSCon London 2016
Performance and how to measure it - ProgSCon London 2016
 

Similar to Machine Learning Model Bakeoff

Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection ProcessBenjamin Bengfort
 
Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]AAKANKSHA JAIN
 
230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptx230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptxArthur240715
 
Faster deep learning solutions from training to inference - Amitai Armon & Ni...
Faster deep learning solutions from training to inference - Amitai Armon & Ni...Faster deep learning solutions from training to inference - Amitai Armon & Ni...
Faster deep learning solutions from training to inference - Amitai Armon & Ni...Codemotion Tel Aviv
 
Machine Learning Model for M.S admissions
Machine Learning Model for M.S admissionsMachine Learning Model for M.S admissions
Machine Learning Model for M.S admissionsOmkar Rane
 
Flock: Data Science Platform @ CISL
Flock: Data Science Platform @ CISLFlock: Data Science Platform @ CISL
Flock: Data Science Platform @ CISLDatabricks
 
Learning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleLearning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleYvonne K. Matos
 
Model-Based Design & Analysis.ppt
Model-Based Design & Analysis.pptModel-Based Design & Analysis.ppt
Model-Based Design & Analysis.pptRajuRaju183149
 
Rise of the Machines: Can Artificial Intelligence Terminate Manual Testing?
Rise of the Machines: Can Artificial Intelligence Terminate Manual Testing?Rise of the Machines: Can Artificial Intelligence Terminate Manual Testing?
Rise of the Machines: Can Artificial Intelligence Terminate Manual Testing?TechWell
 
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, InriaParis Open Source Summit
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use caseFlorian Wilhelm
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use caseinovex GmbH
 
Apache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise CompanyApache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise CompanyDatabricks
 
From Black Box to Black Magic, Pycon Ireland 2014
From Black Box to Black Magic, Pycon Ireland 2014From Black Box to Black Magic, Pycon Ireland 2014
From Black Box to Black Magic, Pycon Ireland 2014Gloria Lovera
 
Deploying ML models to production (frequently and safely) - PYCON 2018
Deploying ML models to production (frequently and safely) - PYCON 2018Deploying ML models to production (frequently and safely) - PYCON 2018
Deploying ML models to production (frequently and safely) - PYCON 2018David Tan
 
Generatingcharacterizationtestsforlegacycode
GeneratingcharacterizationtestsforlegacycodeGeneratingcharacterizationtestsforlegacycode
GeneratingcharacterizationtestsforlegacycodeCarl Schrammel
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用CHENHuiMei
 
GlobalLogic Test Automation Online TechTalk “Test Driven Development as a Per...
GlobalLogic Test Automation Online TechTalk “Test Driven Development as a Per...GlobalLogic Test Automation Online TechTalk “Test Driven Development as a Per...
GlobalLogic Test Automation Online TechTalk “Test Driven Development as a Per...GlobalLogic Ukraine
 

Similar to Machine Learning Model Bakeoff (20)

Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection Process
 
Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]
 
230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptx230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptx
 
Faster deep learning solutions from training to inference - Amitai Armon & Ni...
Faster deep learning solutions from training to inference - Amitai Armon & Ni...Faster deep learning solutions from training to inference - Amitai Armon & Ni...
Faster deep learning solutions from training to inference - Amitai Armon & Ni...
 
Machine Learning Model for M.S admissions
Machine Learning Model for M.S admissionsMachine Learning Model for M.S admissions
Machine Learning Model for M.S admissions
 
Flock: Data Science Platform @ CISL
Flock: Data Science Platform @ CISLFlock: Data Science Platform @ CISL
Flock: Data Science Platform @ CISL
 
Learning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleLearning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and Kaggle
 
Deep learning in manufacturing predicting and preventing manufacturing defect...
Deep learning in manufacturing predicting and preventing manufacturing defect...Deep learning in manufacturing predicting and preventing manufacturing defect...
Deep learning in manufacturing predicting and preventing manufacturing defect...
 
Model-Based Design & Analysis.ppt
Model-Based Design & Analysis.pptModel-Based Design & Analysis.ppt
Model-Based Design & Analysis.ppt
 
Rise of the Machines: Can Artificial Intelligence Terminate Manual Testing?
Rise of the Machines: Can Artificial Intelligence Terminate Manual Testing?Rise of the Machines: Can Artificial Intelligence Terminate Manual Testing?
Rise of the Machines: Can Artificial Intelligence Terminate Manual Testing?
 
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use case
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use case
 
Apache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise CompanyApache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise Company
 
From Black Box to Black Magic, Pycon Ireland 2014
From Black Box to Black Magic, Pycon Ireland 2014From Black Box to Black Magic, Pycon Ireland 2014
From Black Box to Black Magic, Pycon Ireland 2014
 
Py conie 2014
Py conie 2014Py conie 2014
Py conie 2014
 
Deploying ML models to production (frequently and safely) - PYCON 2018
Deploying ML models to production (frequently and safely) - PYCON 2018Deploying ML models to production (frequently and safely) - PYCON 2018
Deploying ML models to production (frequently and safely) - PYCON 2018
 
Generatingcharacterizationtestsforlegacycode
GeneratingcharacterizationtestsforlegacycodeGeneratingcharacterizationtestsforlegacycode
Generatingcharacterizationtestsforlegacycode
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
 
GlobalLogic Test Automation Online TechTalk “Test Driven Development as a Per...
GlobalLogic Test Automation Online TechTalk “Test Driven Development as a Per...GlobalLogic Test Automation Online TechTalk “Test Driven Development as a Per...
GlobalLogic Test Automation Online TechTalk “Test Driven Development as a Per...
 

Recently uploaded

Microsoft x 2toLead Webinar Session 2 - How Employee Learning and Development...
Microsoft x 2toLead Webinar Session 2 - How Employee Learning and Development...Microsoft x 2toLead Webinar Session 2 - How Employee Learning and Development...
Microsoft x 2toLead Webinar Session 2 - How Employee Learning and Development...2toLead Limited
 
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...ShapeBlue
 
Mind your App Footprint 🐾⚡️🌱 (@FlutterHeroes 2024)
Mind your App Footprint 🐾⚡️🌱 (@FlutterHeroes 2024)Mind your App Footprint 🐾⚡️🌱 (@FlutterHeroes 2024)
Mind your App Footprint 🐾⚡️🌱 (@FlutterHeroes 2024)François
 
Learning About GenAI Engineering with AWS PartyRock [AWS User Group Basel - F...
Learning About GenAI Engineering with AWS PartyRock [AWS User Group Basel - F...Learning About GenAI Engineering with AWS PartyRock [AWS User Group Basel - F...
Learning About GenAI Engineering with AWS PartyRock [AWS User Group Basel - F...Chris Bingham
 
Boosting Developer Effectiveness with a Java platform team 1.4 - ArnhemJUG
Boosting Developer Effectiveness with a Java platform team 1.4 - ArnhemJUGBoosting Developer Effectiveness with a Java platform team 1.4 - ArnhemJUG
Boosting Developer Effectiveness with a Java platform team 1.4 - ArnhemJUGRick Ossendrijver
 
AI-Plugins-Planners-Persona-SemanticKernel.pptx
AI-Plugins-Planners-Persona-SemanticKernel.pptxAI-Plugins-Planners-Persona-SemanticKernel.pptx
AI-Plugins-Planners-Persona-SemanticKernel.pptxUdaiappa Ramachandran
 
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlueCloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlueShapeBlue
 
AMER Introduction to ThousandEyes Webinar
AMER Introduction to ThousandEyes WebinarAMER Introduction to ThousandEyes Webinar
AMER Introduction to ThousandEyes WebinarThousandEyes
 
iOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingeriOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingerssuser9354ce
 
Python For Kids - Sách Lập trình cho trẻ em
Python For Kids - Sách Lập trình cho trẻ emPython For Kids - Sách Lập trình cho trẻ em
Python For Kids - Sách Lập trình cho trẻ emNho Vĩnh
 
Centralized TLS Certificates Management Using Vault PKI + Cert-Manager
Centralized TLS Certificates Management Using Vault PKI + Cert-ManagerCentralized TLS Certificates Management Using Vault PKI + Cert-Manager
Centralized TLS Certificates Management Using Vault PKI + Cert-ManagerSaiLinnThu2
 
Why Disability Justice should be at the core of your digital accessibility jo...
Why Disability Justice should be at the core of your digital accessibility jo...Why Disability Justice should be at the core of your digital accessibility jo...
Why Disability Justice should be at the core of your digital accessibility jo...Modality Co
 
Trading Software Development_ Trends to Watch in 2024.pdf
Trading Software Development_ Trends to Watch in 2024.pdfTrading Software Development_ Trends to Watch in 2024.pdf
Trading Software Development_ Trends to Watch in 2024.pdfLucas Lagone
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsScyllaDB
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIVijayananda Mohire
 
Improving IT Investment Decisions and Business Outcomes with Integrated Enter...
Improving IT Investment Decisions and Business Outcomes with Integrated Enter...Improving IT Investment Decisions and Business Outcomes with Integrated Enter...
Improving IT Investment Decisions and Business Outcomes with Integrated Enter...Cprime
 
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...2toLead Limited
 
Mastering Play Store App Listing and Optimization
Mastering Play Store App Listing and OptimizationMastering Play Store App Listing and Optimization
Mastering Play Store App Listing and OptimizationAppsthentic Technology
 
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)Jay Zhao
 
Roundtable_-_API_Research__Testing_Tools.pdf
Roundtable_-_API_Research__Testing_Tools.pdfRoundtable_-_API_Research__Testing_Tools.pdf
Roundtable_-_API_Research__Testing_Tools.pdfMostafa Higazy
 

Recently uploaded (20)

Microsoft x 2toLead Webinar Session 2 - How Employee Learning and Development...
Microsoft x 2toLead Webinar Session 2 - How Employee Learning and Development...Microsoft x 2toLead Webinar Session 2 - How Employee Learning and Development...
Microsoft x 2toLead Webinar Session 2 - How Employee Learning and Development...
 
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
 
Mind your App Footprint 🐾⚡️🌱 (@FlutterHeroes 2024)
Mind your App Footprint 🐾⚡️🌱 (@FlutterHeroes 2024)Mind your App Footprint 🐾⚡️🌱 (@FlutterHeroes 2024)
Mind your App Footprint 🐾⚡️🌱 (@FlutterHeroes 2024)
 
Learning About GenAI Engineering with AWS PartyRock [AWS User Group Basel - F...
Learning About GenAI Engineering with AWS PartyRock [AWS User Group Basel - F...Learning About GenAI Engineering with AWS PartyRock [AWS User Group Basel - F...
Learning About GenAI Engineering with AWS PartyRock [AWS User Group Basel - F...
 
Boosting Developer Effectiveness with a Java platform team 1.4 - ArnhemJUG
Boosting Developer Effectiveness with a Java platform team 1.4 - ArnhemJUGBoosting Developer Effectiveness with a Java platform team 1.4 - ArnhemJUG
Boosting Developer Effectiveness with a Java platform team 1.4 - ArnhemJUG
 
AI-Plugins-Planners-Persona-SemanticKernel.pptx
AI-Plugins-Planners-Persona-SemanticKernel.pptxAI-Plugins-Planners-Persona-SemanticKernel.pptx
AI-Plugins-Planners-Persona-SemanticKernel.pptx
 
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlueCloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
 
AMER Introduction to ThousandEyes Webinar
AMER Introduction to ThousandEyes WebinarAMER Introduction to ThousandEyes Webinar
AMER Introduction to ThousandEyes Webinar
 
iOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingeriOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostinger
 
Python For Kids - Sách Lập trình cho trẻ em
Python For Kids - Sách Lập trình cho trẻ emPython For Kids - Sách Lập trình cho trẻ em
Python For Kids - Sách Lập trình cho trẻ em
 
Centralized TLS Certificates Management Using Vault PKI + Cert-Manager
Centralized TLS Certificates Management Using Vault PKI + Cert-ManagerCentralized TLS Certificates Management Using Vault PKI + Cert-Manager
Centralized TLS Certificates Management Using Vault PKI + Cert-Manager
 
Why Disability Justice should be at the core of your digital accessibility jo...
Why Disability Justice should be at the core of your digital accessibility jo...Why Disability Justice should be at the core of your digital accessibility jo...
Why Disability Justice should be at the core of your digital accessibility jo...
 
Trading Software Development_ Trends to Watch in 2024.pdf
Trading Software Development_ Trends to Watch in 2024.pdfTrading Software Development_ Trends to Watch in 2024.pdf
Trading Software Development_ Trends to Watch in 2024.pdf
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AI
 
Improving IT Investment Decisions and Business Outcomes with Integrated Enter...
Improving IT Investment Decisions and Business Outcomes with Integrated Enter...Improving IT Investment Decisions and Business Outcomes with Integrated Enter...
Improving IT Investment Decisions and Business Outcomes with Integrated Enter...
 
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
 
Mastering Play Store App Listing and Optimization
Mastering Play Store App Listing and OptimizationMastering Play Store App Listing and Optimization
Mastering Play Store App Listing and Optimization
 
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
 
Roundtable_-_API_Research__Testing_Tools.pdf
Roundtable_-_API_Research__Testing_Tools.pdfRoundtable_-_API_Research__Testing_Tools.pdf
Roundtable_-_API_Research__Testing_Tools.pdf
 

Machine Learning Model Bakeoff

  • 1. MODEL BAKEOFF WHICH MODEL CAME HOT AND FRESH OUT THE KITCHEN IN OUR MALWARE CLASSIFIER BAKEOFF? Data Intelligence Conference 2017 Phil Roth
  • 3. 3 whoami Hyrum Anderson @drhyrum Jonathan Woodbridge @jswoodbridge Bobby Filar @filar Phil Roth Data Scientist @mrphilroth proth@endgame.com
  • 4. Data Science in Security 4 Domain Generation Algorithm (DGA) Protection Malicious File Classification Anomaly Detection Insider Threat Detection Network Intrusion Detection System doable hardly doable etc….
  • 5. Mission 5 The best minds of my generation are thinking about how to make people click ads. That sucks. - Jeff Hammerbacher https://www.bloomberg.com/news/articles/2011-04-14/this-tech-bubble-is-different
  • 7. Challenges 7 Lack of open datasets Labels are expensive to obtain High cost of false positives AND false negatives
  • 8. Malware Classification Antivirus. As a supervised ML problem. Windows Executables (portable executables or PEs) are sorted into two classes: benign and malicious 8
  • 9. MalwareScore Static features Deployed to customer machines Available at VirusTotal 9 https://www.virustotal.com/
  • 10. Why a Model Bakeoff? Decide on an approach 10
  • 11. Why a Model Bakeoff? Build institutional knowledge of diverse models 11
  • 12. Why NOT a Model Bakeoff? Incomplete exploration of model design space 12 Mike Bostock. Visualizing Algorithms. https://bost.ocks.org/mike/algorithms/
  • 13. Why NOT a Model Bakeoff? Setting up the bakeoff is most of the work 13 Highly Scientific and Definitely Not Made Up Data About How Hard Tasks Are
  • 14. Why NOT a Bakeoff? Optimizing the model is squeezing the last bits of performance. 14 Blue is so much better!!!!
  • 15. Setup
  • 16. Data 16 Top 10 Malicious Families MalwareScore is trained on 6M benign and 3M malicious files The bakeoff was carried out on an early subset of those
  • 17. Features 17 Byte Histogram Sliding Window Byte Entropy PE Information PE Imports PE Exports PE Sections PE Version Information Raw data based features PE Header based features
  • 18. Features 18 Byte Histogram Sliding Window Byte Entropy 0 3 0 0 0 4 0 0 0 255 255 0 0 184 0 0 0 0 0 0 0 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 216 0 0 77 90 144 14 31 186 14 0 180 9 205 33 184 1 76 205 33 84 104 105 115 32 112 114 111 103 114 97 109 32 99 97 110 110 111 116 32 98 101 32 117 110 32 105 110 32 68 79 83 32 109 111 100 101 46 13 13 10 36 0 0 0 0 0 0 0 49 184 132 58 117 217 234 105 117 217 234 105 11 217 234 105 182 214 181 105 119 217 234 105 117 217 235 105 238 217 234 105 182 214 183 105 100 217 234 105 33 250 218 105 127 217 234 105 178 223 236 105 116 217 234 105 82 105 99 104 117 21 234 105 0 2
  • 19. Features 19 PE Information PE Imports PE Exports PE Sections PE Version Information
  • 20. Features 20 PE Information PE Imports PE Exports PE Sections PE Version Information Feature Hashing is applied to each of these feature groups.
  • 21. Models 21 Model architectures were chosen by team member most knowledgeable Small grid search carried out to find the best parameters
  • 22. Metrics to Judge Models Performance Model size Query execution time 22
  • 23. Metrics Performance: ROC AUC (area under receiver operating characteristic curve) 23 http://scikit-learn.org/stable/modules/model_evaluation.html#roc-metrics
  • 24. Metrics Model Size: Gauge the feasibility of deploying a model to a user’s computer Query Time: Gauge the feasibility of evaluating PE files as they are written to disk or as users attempt to run them. 24
  • 27. Nearest Neighbor 27 from sklearn.neighbors import KNeighborsClassifier tuned_parameters = {'n_neighbors': [1,5,11], 'weights': ['uniform','distance']} model = bake(KNeighborsClassifier(algorithm='ball_tree'), tuned_parameters, X[ix], y[ix]) pickle.dump(model, open('Bakeoff_kNN.pkl', 'w')) Introduction to Statistical Learning page 40
  • 28. Logistic Regression 28 from sklearn.linear_model import SGDClassifier tuned_parameters = {'alpha':[1e-5,1e-4,1e-3], 'l1_ratio': [0., 0.15, 0.85, 1.0]} model = bake(SGDClassifier(loss='log', penalty='elasticnet'), tuned_parameters, X, y) pickle.dump(model, open('Bakeoff_logisticRegression.pkl', 'w')) Introduction to Statistical Learning page 131
  • 29. Support Vector Machine 29 Introduction to Statistical Learning page 342 from sklearn.linear_model import SGDClassifier tuned_parameters = {'alpha':[1e-5,1e-4,1e-3], 'l1_ratio': [0., 0.15, 0.85, 1.0]} model = bake(SGDClassifier(loss='hinge', penalty='elasticnet'), tuned_parameters, X, y) pickle.dump(model, open('Bakeoff_SVM.pkl', 'w'))
  • 30. Naïve Bayes 30 from sklearn.naive_bayes import GaussianNB tuned_parameters = {} model = bake(GaussianNB(), tuned_parameters, X, y) pickle.dump(model, open('Bakeoff_NB.pkl', 'w')) http://scikit-learn.org/stable/modules/naive_bayes.html
  • 31. Random Forest 31 from sklearn.ensemble import RandomForestClassifier tuned_parameters = {'n_estimators': [20, 50, 100], 'min_samples_split': [2, 5, 10], 'max_features': ["sqrt", .1, 0.2]} model = bake(RandomForestClassifier(oob_score=False), tuned_parameters, X, y pickle.dump(R, open('Bakeoff_randomforest.pkl', 'w')) Introduction to Statistical Learning page 308
  • 32. Gradient Boosted Decision Trees 32 from xgboost import XGBClassifier tuned_parameters = {'max_depth': [3, 4, 5], 'n_estimators': [20, 50, 100], 'colsample_bytree': [0.9, 1.0]} model = bake(XGBClassifier(), tuned_parameters, X, y) pickle.dump(R, open('Bakeoff_xgboost.pkl', 'w')) http://scikit-learn.org/stable/auto_examples/index.html
  • 33. Deep Learning 33 Features are fed to a multilayer perceptron with three hidden layers using dropout and batch normalization. http://deeplearning.net/tutorial/mlp.html
  • 34. 34 from keras.models import Sequential from keras.layers.core import Dense, Activation, Dropout from keras.layers import PReLU, BatchNormalization model = Sequential() model.add(Dropout(input_dropout, input_shape=(n_units,))) model.add( BatchNormalization() ) model.add(Dense(n_units, input_shape=(n_units,))) model.add(PReLU()) model.add(Dropout(hidden_dropout)) model.add( BatchNormalization() ) model.add(Dense(n_units)) model.add(PReLU()) model.add(Dropout(hidden_dropout)) model.add( BatchNormalization() ) model.add(Dense(n_units)) model.add(PReLU()) model.add(Dropout(hidden_dropout)) model.add( BatchNormalization() ) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
  • 42. Takeaways We set up a Kaggle competition. xgboost tends to win these competitions. 42 http://hdl.handle.net/11250/2433761 Available at: Proposed answer: Newton boosting instead of MART (multiple additive regression trees)
  • 43. But deep learning… We haven’t yet traveled far enough down the deep learning path. This does not yet exist for PE files: 43 GoogLeNet Architecture from Inception paper. Available at: https://arxiv.org/abs/1409.4842
  • 44. But deep learning… 44 The most useful features are transparently available. Decision trees set a very high bar for deep learning to clear.
  • 45. But deep learning… 45 At Endgame, we do believe there is discriminating power in the section data itself. We are researching the best deep learning architectures and the best way to combine the results with PE header features.
  • 46. Our Conclusions 46 Let’s run gbdt on the endpoint! Let’s provide a larger model in the cloud! Deep learning deserves more research!
  • 47. Further Reading 47 Examining Malware with Python https://www.endgame.com/blog/technical-blog/examining-malware-python Machine Learning https://www.endgame.com/blog/technical-blog/machine-learning-you-gotta-tame-beast-you-let-it-out-its-cage It’s a Bake-off! https://www.endgame.com/blog/technical-blog/its-bake-navigating-evolving-world-machine-learning-models