SlideShare a Scribd company logo
1 of 17
Download to read offline
Motivation Methodology Evaluation Conclusion
Machine Learning for Malware Detection:
Beyond Accuracy Rates
Lucas Galante, Marcus Botacin, Andr´e Gr´egio, Paulo L´ıcio de
Geus
SBSEG 2019
Machine Learning for Malware Detection: Beyond Accuracy Rates 1 / 17
Motivation Methodology Evaluation Conclusion
Agenda
1 Motivation
Motivation
2 Methodology
Malware Classifier
3 Evaluation
Beyond Accuracy Rates
4 Conclusion
Conclusion
Machine Learning for Malware Detection: Beyond Accuracy Rates 2 / 17
Motivation Methodology Evaluation Conclusion
Motivation
Agenda
1 Motivation
Motivation
2 Methodology
Malware Classifier
3 Evaluation
Beyond Accuracy Rates
4 Conclusion
Conclusion
Machine Learning for Malware Detection: Beyond Accuracy Rates 3 / 17
Motivation Methodology Evaluation Conclusion
Motivation
Malware Increase
Figure: Increase of 46% in malware activity in Q2 of 2019.
https://tinyurl.com/y6qzn83h
Machine Learning for Malware Detection: Beyond Accuracy Rates 4 / 17
Motivation Methodology Evaluation Conclusion
Motivation
Malware Classification
Figure: Necessity of antimalware applications.
https://tinyurl.com/y2bz5k58
Machine Learning for Malware Detection: Beyond Accuracy Rates 5 / 17
Motivation Methodology Evaluation Conclusion
Malware Classifier
Agenda
1 Motivation
Motivation
2 Methodology
Malware Classifier
3 Evaluation
Beyond Accuracy Rates
4 Conclusion
Conclusion
Machine Learning for Malware Detection: Beyond Accuracy Rates 6 / 17
Motivation Methodology Evaluation Conclusion
Malware Classifier
Extracted Features
Table: Malware Features classified according extraction method (static
and dynamic) and representation (discrete or continuous).
Static Dynamic
Discrete Continuous Both
Embedded files Dissasembly fail Size sections # headers fork syscall /proc access
/home string ptrace syscall /home string # .dynamic ptrace syscall /home access
/sys string Network strings /sys string # sections socket syscall passwd access
Linkage Header present passwd string # symbols mmap syscal permission denied
UPX passwd string # libs # relocations SIGTERM
fork syscall compiler string Size sample # debug section SIGSEGV
Machine Learning for Malware Detection: Beyond Accuracy Rates 7 / 17
Motivation Methodology Evaluation Conclusion
Malware Classifier
Classification Overview
Figure: Overview of classification process.
Machine Learning for Malware Detection: Beyond Accuracy Rates 8 / 17
Motivation Methodology Evaluation Conclusion
Beyond Accuracy Rates
Agenda
1 Motivation
Motivation
2 Methodology
Malware Classifier
3 Evaluation
Beyond Accuracy Rates
4 Conclusion
Conclusion
Machine Learning for Malware Detection: Beyond Accuracy Rates 9 / 17
Motivation Methodology Evaluation Conclusion
Beyond Accuracy Rates
Importance of a Good Feature Extraction Procedure
SVM classification of static continuous features.
Kernel/Iter(#) 1000 10000 100000
Poly 49.32% 49.74% 49.95%
Linear 73.87% 77.64% 80.94%
rbf 84.92% 84.92% 84.92%
SVM classification of dynamic continuous features.
Kernel/ Iter (#) 1000 10000 100000
Poly 49.92% 49.76% 50.71%
Linear 93.73% 86.51% 86.73%
rbf 92.63% 92.63% 92.63%
Machine Learning for Malware Detection: Beyond Accuracy Rates 10 / 17
Motivation Methodology Evaluation Conclusion
Beyond Accuracy Rates
Importance of Evaluated Datasets
Mixed dataset. Random Forest classification of static
continuous features.
Max Depth/ Estimators (#) 16 32 64
8 99.17% 99.06% 99.20%
16 99.13% 99.06% 99.09%
32 99.09% 99.13% 99.17%
VirusTotal dataset. Random Forest classification of static
continuous features.
Max Depth/ Estimators (#) 16 32 64
8 94.29% 94.35% 94.24%
16 94.24% 94.14% 94.08%
32 94.08% 94.14% 94.19%
Machine Learning for Malware Detection: Beyond Accuracy Rates 11 / 17
Motivation Methodology Evaluation Conclusion
Beyond Accuracy Rates
Analyst Importance
SVM classification of dynamic continuous features.
Kernel/ Iter (#) 1000 10000 100000
Poly 50.91% 54.05% 58.16%
Linear 97.97% 97.56% 80.35%
rbf 98.54% 98.54% 98.54%
SVM classification of dynamic discrete features.
Kernel/ Iter (#) 1000 10000 100000
Poly 79.68% 79.91% 79.91%
Linear 96.48% 96.48% 96.48%
rbf 96.35% 96.35% 96.35%
Machine Learning for Malware Detection: Beyond Accuracy Rates 12 / 17
Motivation Methodology Evaluation Conclusion
Beyond Accuracy Rates
What ML results teach us
Static feature importance
Static
Discrete Continuous
Network strings 40% Binary size 27%
UPX present 17% # headers 16.70%
passwd strings 1.40% # debug sections 0.20%
Dynamic feature importance
Dynamic
Discrete Continuous
mmap 50% # mmap 68%
fork 6% # fork 10.80%
SIGSEGV 10.60% # SIGSEGV 1.30%
Machine Learning for Malware Detection: Beyond Accuracy Rates 13 / 17
Motivation Methodology Evaluation Conclusion
Conclusion
Agenda
1 Motivation
Motivation
2 Methodology
Malware Classifier
3 Evaluation
Beyond Accuracy Rates
4 Conclusion
Conclusion
Machine Learning for Malware Detection: Beyond Accuracy Rates 14 / 17
Motivation Methodology Evaluation Conclusion
Conclusion
Conclusion
Our results show that:
Dynamic features outperforms static features
Discrete features present smaller accuracy variance
Dataset’s distinct characteristics impose challenges to ML
models
Feature analysis can be used as feedback information
Machine Learning for Malware Detection: Beyond Accuracy Rates 15 / 17
Motivation Methodology Evaluation Conclusion
Conclusion
Acknowledgement
This work is supported by:
Brazilian National Counsel of Technological and Scientific
Development
CESeg assistance
Machine Learning for Malware Detection: Beyond Accuracy Rates 16 / 17
Motivation Methodology Evaluation Conclusion
Conclusion
Questions, Critics and Suggestions.
Contact
galante@lasca.ic.unicamp.br
Complete version
https://github.com/marcusbotacin/ELF.Classifier
Previous work
https://github.com/marcusbotacin/Linux.Malware
Reverse Engineering Workshop
Thursday @ 13:30
Machine Learning for Malware Detection: Beyond Accuracy Rates 17 / 17

More Related Content

Similar to Machine Learning for Malware Detection: Beyond Accuracy Rates

Similar to Machine Learning for Malware Detection: Beyond Accuracy Rates (20)

BH-US-06-Bilar.pdf
BH-US-06-Bilar.pdfBH-US-06-Bilar.pdf
BH-US-06-Bilar.pdf
 
A survey of fault prediction using machine learning algorithms
A survey of fault prediction using machine learning algorithmsA survey of fault prediction using machine learning algorithms
A survey of fault prediction using machine learning algorithms
 
Malware Detection Using Data Mining Techniques
Malware Detection Using Data Mining Techniques Malware Detection Using Data Mining Techniques
Malware Detection Using Data Mining Techniques
 
Malicious Linux binaries: A Landscape
Malicious Linux binaries: A LandscapeMalicious Linux binaries: A Landscape
Malicious Linux binaries: A Landscape
 
Measuring the Code Quality Using Software Metrics
Measuring the Code Quality Using Software MetricsMeasuring the Code Quality Using Software Metrics
Measuring the Code Quality Using Software Metrics
 
Automated Android Malware Detection Using Optimal Ensemble Learning Approach ...
Automated Android Malware Detection Using Optimal Ensemble Learning Approach ...Automated Android Malware Detection Using Optimal Ensemble Learning Approach ...
Automated Android Malware Detection Using Optimal Ensemble Learning Approach ...
 
Antimalware
AntimalwareAntimalware
Antimalware
 
Near-memory & In-Memory Detection of Fileless Malware
Near-memory & In-Memory Detection of Fileless MalwareNear-memory & In-Memory Detection of Fileless Malware
Near-memory & In-Memory Detection of Fileless Malware
 
PROVIDING CYBER SECURITY SOLUTION FOR MALWARE DETECTION USING SUPPORT VECTOR ...
PROVIDING CYBER SECURITY SOLUTION FOR MALWARE DETECTION USING SUPPORT VECTOR ...PROVIDING CYBER SECURITY SOLUTION FOR MALWARE DETECTION USING SUPPORT VECTOR ...
PROVIDING CYBER SECURITY SOLUTION FOR MALWARE DETECTION USING SUPPORT VECTOR ...
 
Malware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning TechniquesMalware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning Techniques
 
Pindroid - Android Malware Detection Tool
Pindroid - Android Malware Detection Tool Pindroid - Android Malware Detection Tool
Pindroid - Android Malware Detection Tool
 
Criminal Identification using Arm7
Criminal Identification using Arm7Criminal Identification using Arm7
Criminal Identification using Arm7
 
Injection Attack detection using ML for
Injection Attack detection using ML  forInjection Attack detection using ML  for
Injection Attack detection using ML for
 
Network intrusion detection using supervised machine learning technique with ...
Network intrusion detection using supervised machine learning technique with ...Network intrusion detection using supervised machine learning technique with ...
Network intrusion detection using supervised machine learning technique with ...
 
IRJET - Research on Data Mining of Permission-Induced Risk for Android Devices
IRJET - Research on Data Mining of Permission-Induced Risk for Android DevicesIRJET - Research on Data Mining of Permission-Induced Risk for Android Devices
IRJET - Research on Data Mining of Permission-Induced Risk for Android Devices
 
COMPARISON OF MALWARE CLASSIFICATION METHODS USING CONVOLUTIONAL NEURAL NETWO...
COMPARISON OF MALWARE CLASSIFICATION METHODS USING CONVOLUTIONAL NEURAL NETWO...COMPARISON OF MALWARE CLASSIFICATION METHODS USING CONVOLUTIONAL NEURAL NETWO...
COMPARISON OF MALWARE CLASSIFICATION METHODS USING CONVOLUTIONAL NEURAL NETWO...
 
Local Descriptor based Face Recognition System
Local Descriptor based Face Recognition SystemLocal Descriptor based Face Recognition System
Local Descriptor based Face Recognition System
 
IRJET- Android Malware Detection using Deep Learning
IRJET- Android Malware Detection using Deep LearningIRJET- Android Malware Detection using Deep Learning
IRJET- Android Malware Detection using Deep Learning
 
A Comparative Study for Credit Card Fraud Detection System using Machine Lear...
A Comparative Study for Credit Card Fraud Detection System using Machine Lear...A Comparative Study for Credit Card Fraud Detection System using Machine Lear...
A Comparative Study for Credit Card Fraud Detection System using Machine Lear...
 
Botnet detection using Wgans for security
Botnet detection using Wgans for securityBotnet detection using Wgans for security
Botnet detection using Wgans for security
 

More from Marcus Botacin

More from Marcus Botacin (20)

Machine Learning by Examples - Marcus Botacin - TAMU 2024
Machine Learning by Examples - Marcus Botacin - TAMU 2024Machine Learning by Examples - Marcus Botacin - TAMU 2024
Machine Learning by Examples - Marcus Botacin - TAMU 2024
 
Near-memory & In-Memory Detection of Fileless Malware
Near-memory & In-Memory Detection of Fileless MalwareNear-memory & In-Memory Detection of Fileless Malware
Near-memory & In-Memory Detection of Fileless Malware
 
GPThreats-3: Is Automated Malware Generation a Threat?
GPThreats-3: Is Automated Malware Generation a Threat?GPThreats-3: Is Automated Malware Generation a Threat?
GPThreats-3: Is Automated Malware Generation a Threat?
 
[HackInTheBOx] All You Always Wanted to Know About Antiviruses
[HackInTheBOx] All You Always Wanted to Know About Antiviruses[HackInTheBOx] All You Always Wanted to Know About Antiviruses
[HackInTheBOx] All You Always Wanted to Know About Antiviruses
 
[Usenix Enigma\ Why Is Our Security Research Failing? Five Practices to Change!
[Usenix Enigma\ Why Is Our Security Research Failing? Five Practices to Change![Usenix Enigma\ Why Is Our Security Research Failing? Five Practices to Change!
[Usenix Enigma\ Why Is Our Security Research Failing? Five Practices to Change!
 
Hardware-accelerated security monitoring
Hardware-accelerated security monitoringHardware-accelerated security monitoring
Hardware-accelerated security monitoring
 
How do we detect malware? A step-by-step guide
How do we detect malware? A step-by-step guideHow do we detect malware? A step-by-step guide
How do we detect malware? A step-by-step guide
 
Among Viruses, Trojans, and Backdoors:Fighting Malware in 2022
Among Viruses, Trojans, and Backdoors:Fighting Malware in 2022Among Viruses, Trojans, and Backdoors:Fighting Malware in 2022
Among Viruses, Trojans, and Backdoors:Fighting Malware in 2022
 
Extraindo Caracterı́sticas de Arquivos Binários Executáveis
Extraindo Caracterı́sticas de Arquivos Binários ExecutáveisExtraindo Caracterı́sticas de Arquivos Binários Executáveis
Extraindo Caracterı́sticas de Arquivos Binários Executáveis
 
On the Malware Detection Problem: Challenges & Novel Approaches
On the Malware Detection Problem: Challenges & Novel ApproachesOn the Malware Detection Problem: Challenges & Novel Approaches
On the Malware Detection Problem: Challenges & Novel Approaches
 
All You Need to Know to Win a Cybersecurity Adversarial Machine Learning Comp...
All You Need to Know to Win a Cybersecurity Adversarial Machine Learning Comp...All You Need to Know to Win a Cybersecurity Adversarial Machine Learning Comp...
All You Need to Know to Win a Cybersecurity Adversarial Machine Learning Comp...
 
Does Your Threat Model Consider Country and Culture? A Case Study of Brazilia...
Does Your Threat Model Consider Country and Culture? A Case Study of Brazilia...Does Your Threat Model Consider Country and Culture? A Case Study of Brazilia...
Does Your Threat Model Consider Country and Culture? A Case Study of Brazilia...
 
Integridade, confidencialidade, disponibilidade, ransomware
Integridade, confidencialidade, disponibilidade, ransomwareIntegridade, confidencialidade, disponibilidade, ransomware
Integridade, confidencialidade, disponibilidade, ransomware
 
An Empirical Study on the Blocking of HTTP and DNS Requests at Providers Leve...
An Empirical Study on the Blocking of HTTP and DNS Requests at Providers Leve...An Empirical Study on the Blocking of HTTP and DNS Requests at Providers Leve...
An Empirical Study on the Blocking of HTTP and DNS Requests at Providers Leve...
 
On the Security of Application Installers & Online Software Repositories
On the Security of Application Installers & Online Software RepositoriesOn the Security of Application Installers & Online Software Repositories
On the Security of Application Installers & Online Software Repositories
 
UMLsec
UMLsecUMLsec
UMLsec
 
The Internet Banking [in]Security Spiral: Past, Present, and Future of Online...
The Internet Banking [in]Security Spiral: Past, Present, and Future of Online...The Internet Banking [in]Security Spiral: Past, Present, and Future of Online...
The Internet Banking [in]Security Spiral: Past, Present, and Future of Online...
 
Análise do Malware Ativo na Internet Brasileira: 4 anos depois. O que mudou?
Análise do Malware Ativo na Internet Brasileira: 4 anos depois. O que mudou?Análise do Malware Ativo na Internet Brasileira: 4 anos depois. O que mudou?
Análise do Malware Ativo na Internet Brasileira: 4 anos depois. O que mudou?
 
Towards Malware Decompilation and Reassembly
Towards Malware Decompilation and ReassemblyTowards Malware Decompilation and Reassembly
Towards Malware Decompilation and Reassembly
 
Reverse Engineering Course
Reverse Engineering CourseReverse Engineering Course
Reverse Engineering Course
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

Machine Learning for Malware Detection: Beyond Accuracy Rates

  • 1. Motivation Methodology Evaluation Conclusion Machine Learning for Malware Detection: Beyond Accuracy Rates Lucas Galante, Marcus Botacin, Andr´e Gr´egio, Paulo L´ıcio de Geus SBSEG 2019 Machine Learning for Malware Detection: Beyond Accuracy Rates 1 / 17
  • 2. Motivation Methodology Evaluation Conclusion Agenda 1 Motivation Motivation 2 Methodology Malware Classifier 3 Evaluation Beyond Accuracy Rates 4 Conclusion Conclusion Machine Learning for Malware Detection: Beyond Accuracy Rates 2 / 17
  • 3. Motivation Methodology Evaluation Conclusion Motivation Agenda 1 Motivation Motivation 2 Methodology Malware Classifier 3 Evaluation Beyond Accuracy Rates 4 Conclusion Conclusion Machine Learning for Malware Detection: Beyond Accuracy Rates 3 / 17
  • 4. Motivation Methodology Evaluation Conclusion Motivation Malware Increase Figure: Increase of 46% in malware activity in Q2 of 2019. https://tinyurl.com/y6qzn83h Machine Learning for Malware Detection: Beyond Accuracy Rates 4 / 17
  • 5. Motivation Methodology Evaluation Conclusion Motivation Malware Classification Figure: Necessity of antimalware applications. https://tinyurl.com/y2bz5k58 Machine Learning for Malware Detection: Beyond Accuracy Rates 5 / 17
  • 6. Motivation Methodology Evaluation Conclusion Malware Classifier Agenda 1 Motivation Motivation 2 Methodology Malware Classifier 3 Evaluation Beyond Accuracy Rates 4 Conclusion Conclusion Machine Learning for Malware Detection: Beyond Accuracy Rates 6 / 17
  • 7. Motivation Methodology Evaluation Conclusion Malware Classifier Extracted Features Table: Malware Features classified according extraction method (static and dynamic) and representation (discrete or continuous). Static Dynamic Discrete Continuous Both Embedded files Dissasembly fail Size sections # headers fork syscall /proc access /home string ptrace syscall /home string # .dynamic ptrace syscall /home access /sys string Network strings /sys string # sections socket syscall passwd access Linkage Header present passwd string # symbols mmap syscal permission denied UPX passwd string # libs # relocations SIGTERM fork syscall compiler string Size sample # debug section SIGSEGV Machine Learning for Malware Detection: Beyond Accuracy Rates 7 / 17
  • 8. Motivation Methodology Evaluation Conclusion Malware Classifier Classification Overview Figure: Overview of classification process. Machine Learning for Malware Detection: Beyond Accuracy Rates 8 / 17
  • 9. Motivation Methodology Evaluation Conclusion Beyond Accuracy Rates Agenda 1 Motivation Motivation 2 Methodology Malware Classifier 3 Evaluation Beyond Accuracy Rates 4 Conclusion Conclusion Machine Learning for Malware Detection: Beyond Accuracy Rates 9 / 17
  • 10. Motivation Methodology Evaluation Conclusion Beyond Accuracy Rates Importance of a Good Feature Extraction Procedure SVM classification of static continuous features. Kernel/Iter(#) 1000 10000 100000 Poly 49.32% 49.74% 49.95% Linear 73.87% 77.64% 80.94% rbf 84.92% 84.92% 84.92% SVM classification of dynamic continuous features. Kernel/ Iter (#) 1000 10000 100000 Poly 49.92% 49.76% 50.71% Linear 93.73% 86.51% 86.73% rbf 92.63% 92.63% 92.63% Machine Learning for Malware Detection: Beyond Accuracy Rates 10 / 17
  • 11. Motivation Methodology Evaluation Conclusion Beyond Accuracy Rates Importance of Evaluated Datasets Mixed dataset. Random Forest classification of static continuous features. Max Depth/ Estimators (#) 16 32 64 8 99.17% 99.06% 99.20% 16 99.13% 99.06% 99.09% 32 99.09% 99.13% 99.17% VirusTotal dataset. Random Forest classification of static continuous features. Max Depth/ Estimators (#) 16 32 64 8 94.29% 94.35% 94.24% 16 94.24% 94.14% 94.08% 32 94.08% 94.14% 94.19% Machine Learning for Malware Detection: Beyond Accuracy Rates 11 / 17
  • 12. Motivation Methodology Evaluation Conclusion Beyond Accuracy Rates Analyst Importance SVM classification of dynamic continuous features. Kernel/ Iter (#) 1000 10000 100000 Poly 50.91% 54.05% 58.16% Linear 97.97% 97.56% 80.35% rbf 98.54% 98.54% 98.54% SVM classification of dynamic discrete features. Kernel/ Iter (#) 1000 10000 100000 Poly 79.68% 79.91% 79.91% Linear 96.48% 96.48% 96.48% rbf 96.35% 96.35% 96.35% Machine Learning for Malware Detection: Beyond Accuracy Rates 12 / 17
  • 13. Motivation Methodology Evaluation Conclusion Beyond Accuracy Rates What ML results teach us Static feature importance Static Discrete Continuous Network strings 40% Binary size 27% UPX present 17% # headers 16.70% passwd strings 1.40% # debug sections 0.20% Dynamic feature importance Dynamic Discrete Continuous mmap 50% # mmap 68% fork 6% # fork 10.80% SIGSEGV 10.60% # SIGSEGV 1.30% Machine Learning for Malware Detection: Beyond Accuracy Rates 13 / 17
  • 14. Motivation Methodology Evaluation Conclusion Conclusion Agenda 1 Motivation Motivation 2 Methodology Malware Classifier 3 Evaluation Beyond Accuracy Rates 4 Conclusion Conclusion Machine Learning for Malware Detection: Beyond Accuracy Rates 14 / 17
  • 15. Motivation Methodology Evaluation Conclusion Conclusion Conclusion Our results show that: Dynamic features outperforms static features Discrete features present smaller accuracy variance Dataset’s distinct characteristics impose challenges to ML models Feature analysis can be used as feedback information Machine Learning for Malware Detection: Beyond Accuracy Rates 15 / 17
  • 16. Motivation Methodology Evaluation Conclusion Conclusion Acknowledgement This work is supported by: Brazilian National Counsel of Technological and Scientific Development CESeg assistance Machine Learning for Malware Detection: Beyond Accuracy Rates 16 / 17
  • 17. Motivation Methodology Evaluation Conclusion Conclusion Questions, Critics and Suggestions. Contact galante@lasca.ic.unicamp.br Complete version https://github.com/marcusbotacin/ELF.Classifier Previous work https://github.com/marcusbotacin/Linux.Malware Reverse Engineering Workshop Thursday @ 13:30 Machine Learning for Malware Detection: Beyond Accuracy Rates 17 / 17