Machine and Deep Learning Application.
Applying big data learning techniques for a malware classification problem.
Code:
https://gist.github.com/indraneeld/7ffb182fd8eb87d6d463dedc001efad0
Acknowledgments:
Canadian Institute for Cybersecurity (CIC) project in collaboration with Canadian Centre for Cyber Security (CCCS).
1. Android Malware
Machine and Deep Learning Application
Canadian Institute for Cybersecurity (CIC) project in collaboration with Canadian Centre for Cyber Security
(CCCS)
Indraneel C. Dabhade
2021
2. Topics.
Use Case Analysis
Data Exploration
Project Framework
Feature Exploration
Machine Learning Technique
Deep Learning Technique
Introduction
Assumptions and Future Work
Acknowledgements
Appendix
3. Introduction
Questions answered.
Why have I chosen a specific method for data quality assessment?
Why have I chosen a specific method for feature engineering?
Why have I chosen a specific algorithm?
Why have I chosen a specific framework?
Why have I chosen a specific model performance indicator?
4. Introduction
Project Snapshot.
Machine Learning Technique: Random Forest.
Accuracy : ~50.0%
Deep Learning with optimization using Stochastic Gradient Descent.
Accuracy : ~92.0%
Technology Infrastructure:
IBM Cloud Pak
Python 3.7 with Apache Spark 3
TensorFlow 2.0
9. Deep Learning
Variance Threshold
of 1000.
Filter Features
Feature Pipeline Execution
Standard Scaler
Principal Component
Analysis.
Feature
Engineering
Feature
Engineering
Label Pipeline Execution
MinMax Scaler One Hot Encoding
15. Assumptions and Future Work
Not able to run Keras2DML.
Sampling unbiased representation of population.
Further exploration of features using bias vs variance tradeoffs.
Current project focuses on the ‘Dynamic Analysis’ dataset. Can
explore ‘Static Analysis’ datasets.
Acknowledgments
Like to thank family and friends for their support, the Canadian Institute for Cybersecurity (CIC)
and the Canadian Centre for Cyber Security (CCCS), and IBM Cloud Pak for the infrastructure.
All models are wrong, but some are useful – George Box.
16. Appendix
Python code for the Machine Learning execution of this project. (MachineLearning.ipynb)
Python code for the Deep Learning execution of this project. (DeepLearning.ipynb)
Data used for the project. (df.csv)
Data for statistical analysis. (Sample.csv)
Python code for Statistical Analysis. (DataStatistics.ipynb)
https://gist.github.com/indraneeld/7ffb182fd8eb87d6d463dedc001efad0