See discussions, stats,and author profiles for this publication at: https://www.researchgate.net/publication/391644720
Machine Learning Techniques for Enhanced Malware Detection in Portable
Executable Files
Chapter · May 2025
DOI: 10.1007/978-3-031-88653-9_74
CITATIONS
0
READS
80
3 authors:
Walid El Mouhtadi
Université Sultan Moulay Slimane
4 PUBLICATIONS 1 CITATION
SEE PROFILE
Maleh Yassine
Université Sultan Moulay Slimane
300 PUBLICATIONS 1,655 CITATIONS
SEE PROFILE
Soufyane Mounir
National school os Applied Sciences, Morocco, Khouribga
33 PUBLICATIONS 248 CITATIONS
SEE PROFILE
All content following this page was uploaded by Maleh Yassine on 11 May 2025.
The user has requested enhancement of the downloaded file.
Machine Learning Techniquesfor Enhanced Malware Detection 785
• How can adjusting hyperparameters improve ML algorithms’ ability to detect
malware?
This research delves into the application of machine learning for malware clas-
sification, focusing specifically on Portable Executable (PE) files. With a dataset of
96,765 malware samples and 41,323 legitimate samples, we address imbalances and
complexities by employing feature selection techniques. Five machine learning algo-
rithms, including random forests and gradient boosting, were trained and tested, with
hyperparameter tuning significantly enhancing accuracy.
The primary contributions of this research include a thorough investigation into
machine learning for malware classification, particularly targeting PE files, and the
utilization of feature selection techniques to enhance model performance. The study
evaluates algorithms based on accuracy, precision, and performance on unseen files,
emphasizing the importance of effective pre-processing techniques.
The study aims to deepen our understanding of machine learning’s potential and lim-
itations in malware detection, addressing the effectiveness of data balancing, the impact
of pre-processing on portable executable files, and hyperparameter tuning. The paper
outlines an effective malware detection system, highlighting the significance of data
pre-processing, feature selection, and model selection. It introduces a novel perspective
by identifying optimal initial hyperparameter values, serving as a reference for future
research.
The paper will be organized as follows: Sect. 2 will review related works in the
field. Section 3 will detail the data and methods employed, including subsections on
data collection and model optimization, as well as data balancing, splitting, and feature
selection. Section 4 will present the results obtained from the analysis and subsequent
discussions on these findings. Following the results, Sect. 5 will provide further discus-
sion on the implications and significance of the results. Finally, Sect. 6 will conclude
the paper, summarizing the key findings and suggesting avenues for future research.
2 Related Works
Malware detection is a challenging task in cybersecurity [5–7], and various approaches
have been proposed to tackle this issue [8, 9]. Researchers have recently explored dif-
ferent methods, such as machine learning algorithms [10], signature-based protection,
heuristic-based methods and behavior-based methods [11]. However, previous surveys
have limitations regarding coverage of new research trends and the taxonomy of feature
types used for malware detection [12].
Kapratwar et al. [13] integrate static and dynamic analysis to understand Android
malware behavior comprehensively. Static analysis involves extracting permissions from
the manifest file, providing valuable insights into the app’s authorized actions. Dynamic
analysis, on the other hand, captures system calls during runtime, enabling a deeper
understanding of the code’s execution paths.
4.
786 W. ElMouhtadi et al.
Karampudi [14] highlights the limitations of static techniques, such as the deci-
sion tree algorithm, in detecting malware. A hybrid approach is proposed to overcome
these constraints, integrating multiple machine learning (ML) algorithms using ensemble
learning techniques. The Ada Boost classifier is implemented to enhance the efficiency
of the hybrid model. The study evaluates the performance of various ML algorithms,
including Decision Tree, Gaussian Naive Bayes, Random Forest, and Linear SVM, for
analysingmalwarecontent.AlgorithmswithloweraccuracyareboostedusingAdaBoost
to improve performance, resulting in significant accuracy improvements. The study con-
cludes that the hybrid approach, combined with ensemble learning, yields more effective
and accurate results in malware detection.
Jin Li. [15] addressed the alarming growth rate of malicious apps in the mobile
ecosystem, particularly on the Android platform. The authors introduce SigPID, a mal-
ware detection system based on permission usage analysis. The evaluation shows that
using only 22 significant permissions, SigPID achieves high precision, recall, accu-
racy, and F-measure, similar to the baseline approach analyzing all permissions. More-
over, SigPID demonstrates superior effectiveness to other state-of-the-art approaches,
detecting a high percentage of malware in the dataset, including unknown/new malware
samples.
Singh et al. [16] conducted a study exploring the application of a support-vector-
machine (SVM) [17] based machine learning (ML) model for malware detection in
systems. The primary objective of their research was to improve the SVM-ML model’s
performance through data pre-processing. To achieve this, the researchers employed
four distinct data pre-processing techniques [18], These techniques were applied to
a well-established dataset known as the Portable Executable Header (PEH) [19, 20]
classification of malware (CLaMP) dataset.
3 Data and Methods
3.1 Data Collection and Model Optimization
In this research, a robust and efficient data processing and model optimization workflow
has been implemented to address this project’s specific challenges and requirements.
The approach incorporates advanced techniques tailored to tackle the issues associated
with unbalanced data and a large, diverse feature set, which can often result in complex
models prone to over-fitting.
The following schema (Fig. 1) outlines the comprehensive Data Processing and
Model Optimization Workflow developed for efficient processing and model enhance-
ment:
5.
Machine Learning Techniquesfor Enhanced Malware Detection 787
Fig. 1. Data Processing and Model Optimization Workflow
In the quest for precise outcomes, the development of a top dataset is crucial [15].
This study meticulously compiled an extensive dataset comprising 138,042 instances
with 57 distinct features. Of these, 96,000 instances represented malicious files, while
41,000 were legitimate. The dataset was carefully curated from the Kaggle platform
and organized in an Excel file, capturing diverse features related to malware character-
istics, including Worms, Trojan Horses, Ransomware, Spyware, Adware, Keyloggers,
Rootkits, Botnets, Fileless Malware, and Macro Viruses from 2017 to 2023.
During dataset refinement, 54 portable executable characteristics were scrutinized.
The Extratree Classifier algorithm identified 14 crucial features, chosen to prevent
underfitting, overfitting, and bias, ensuring robust subsequent analyses.
To assess model efficiency, legitimate and malicious files from various online plat-
forms were collected. A comprehensive approach to data collection from VirusShare,
Malware Bazzar, and VirusTotal enhanced study comprehensiveness and result accuracy.
Addressing class imbalance, a Data Rectification technique adjusted class instances
for a representative distribution [21, 22]. Feature selection reduced dataset dimension-
ality for more efficient machine learning algorithms.
Hyperparameter tuning optimized model performance by systematically adjusting
algorithm-specific parameters. Trained models underwent evaluation using performance
metrics like accuracy, precision, recall, and F1-score.
Saved as files for practical use, the models allow easy deployment in real-world
scenarios. Evaluating model performance on a separate set of PE files provided insights
into their effectiveness and generalization capability.
Through systematic implementation and evaluation, this research aimed to enhance
malware detection accuracy and efficiency. The presented findings and methodologies
contribute to advancing malware analysis, offering valuable insights for cybersecurity
applications and future research. See Fig. 2 for an illustration of the advanced architecture
designed for accurate file classification.
6.
788 W. ElMouhtadi et al.
Fig. 2. Advanced Architecture for Accurate Classification of Unseen Files
3.2 Data Balancing, Splitting and Feature Selection
This step ensures that the data used for training the model is balanced, meaning each
class has an equal number of samples. This is important because imbalanced data can
lead to a biased model toward the majority.
The graphs below in Fig. 3 and Fig. 4 show the percentage of samples in each class
before and after balancing the data:
Fig. 3. Before Balancing Fig. 4. After Balancing
• Before balancing: 138,088 samples (96765 Malicious and 41323 Legitimate samples)
7.
Machine Learning Techniquesfor Enhanced Malware Detection 789
• After balancing: 82,646 samples for each class (41,323 Malicious and 41,323
Legitimate samples).
– Data Splitting:
Techniques like ExtraTreesClassifier utilize feature importance derived from deci-
sion trees to rank features based on their contribution to the model’s predictive
accuracy:
Let X be the input feature matrix of shape (n_samples, n_features) and y be the target
vector of shape (n_samples,), where:
– N_samples is the number of samples in the dataset.
– N_features is the number of features in each sample.
The ExtraTreesClassifier fits a set of decision trees T1, T2, … , Tn on different
random subsamples of X and y, and then aggregates their predictions through averaging.
– Feature Importance:
The importance of each feature is determined by the
extratrees.feature_importances_attribute, which provides the feature importance scores
calculated by the ExtraTreesClassifier.
– Printing Features:
The loop prints out the top features identified by the ExtraTreesClassifier along with
their corresponding importance.
The algorithm helped identify the most important features for accurate classification
by analyzing many potential features. As a result, 14 features were pinpointed as being
the most useful for determining whether a file was malicious. By using these 14 fea-
tures, the classification task achieved high accuracy. This highlights the effectiveness of
ExtraTreesClassifier in selecting the required features for classifying files.
The visualization below in (Fig. 5) displays the 14 most important features identified
by the ExtraTreesClassifier for accurate classification:
4 Results and Discussion
4.1 Performance Evaluation
The experiments were conducted on a computer with an Intel Core i7 CPU, 16 GB RAM,
and Intel® UHD GPU with 4.1 GB dedicated video memory. The setup used Windows
10, Python 3.8.5, and various libraries. Data, stored locally, underwent training and
evaluation on machine learning algorithms, including Random Forest, Decision Tree,
Adaboost, Gradient Boosting, and GNB, with insights derived from accuracy assess-
ments. The comparison parameters include True Positive Rate (TPR), False Positive
Rate (FPR), Precision, and Accuracy.
8.
790 W. ElMouhtadi et al.
Fig. 5. Importance of the relevant feature
Before delving into the fine-tuning process, let’s first explore how our models per-
formed with their initial, out-of-the-box configurations. The confusion matrices unveiled
at this stage will serve as our benchmark, providing a baseline for the subsequent
enhancements.
Table 2 below displays the values of performance criteria for every algorithm that
was tested:
Table 2. Criteria for Initial Models 1
Algorithm Accuracy Precision TPR FNR
DecisionTree 0.99062311 0.988 0.992 0.012
RandomForest 0.99165154 0.992 0.995 0.008
Adaboost 0.98324258 0.984 0.982 0.016
GradientBoosting 0.98505747 0.987 0.987 0.013
GNB 0.40505747 0.477 0.478 0.011
4.2 Optimized Hyperparameters for Various Models
The hyperparameters chosen for each model were carefully selected to strike a balance
between bias and variance, aiming to prevent overfitting while capturing the complexity
of the data. These choices were made through a thorough analysis of data characteristics,
iterative testing, and a deep understanding of each hyperparameter.
For example, specific values were assigned to hyperparameters such as max_features
and n_estimators, with both set to 15 to encompass all 14 essential features. Lower values
for min_samples_split (2, 5, 10) were chosen after extensive testing to enhance accuracy.
The selection of max_depth values (5, 10, 15) was influenced by the feature selection
process, prioritizing precision and simplicity with the 14 chosen features.
9.
Machine Learning Techniquesfor Enhanced Malware Detection 791
It’s important to note that using only 2 parameters would rely on just 34% of the data’s
importance, as depicted in Fig. 5 - Importance of Relevant Features. By incorporating
at least 5 features, decisions were based on at least 48% of the data’s importance and
feature variance, ensuring a more comprehensive understanding of the dataset.
This Table 3 below provides a comprehensive reference for the parameters we’ve
chosen to optimize, offering valuable insights into our model enhancement process:
Table 3. Performance for Initial Models
Model The Recommended Hyperparameters Before Tunning
DecisionTree {“max depth”: [5, 10, 15], “min samples leaf”: [1, 5, 10]}
RandomForest { “n estimators”: [5, 10, 15],”max depth”: [5, 10, 15],
“max features”: [5, 10, 15], “min samples split”: [2, 5, 10]}
Adaboost {“n estimators”: [5, 10, 15], “learning rate”: [0.1, 0.5, 1.0]}
GradientBoosting {“n estimators”: [5, 10, 15], “learning rate”: [0.1, 0.5, 1.0]}
GNB {}
4.3 Performance Evaluation with Optimized Hyperparameters
After meticulous hyperparameter tuning, we re-evaluate the models to reveal the results
of our optimization efforts. The newly refined confusion matrices now serve as evidence
of our dedication to harnessing the full potential of each model. Here, we observe the
transition from the initial state to the enhanced state, where each matrix symbolizes the
culmination of precision and robustness.
The confusion matrix below (Fig. 6a to Fig. 6e) presents the true labels, including
true/false positives and true/false negatives, for the predicted labels:
10.
792 W. ElMouhtadi et al.
(a) (b)
(c) (d)
(e)
Fig. 6. a. Confusion Matrix for DecisionTree. b. Confusion Matrix for RandomForest. c. Con-
fusion Matrix for Adaboost. d. Confusion Matrix for GradientBoosting. e. Confusion Matrix for
GNB
In Table 4 below, we provide a summary of the best hyperparameters for each model
after tuning:
794 W. ElMouhtadi et al.
5 Discussion
The study focused on evaluating machine learning algorithms for detecting malware
using ROC curves. DecisionTree, RandomForest, and GradientBoosting showed perfect
true positive rates but might be prone to overfitting or dataset specificity. Adaboost and
Gaussian Naive Bayes (GNB) lacked ROC curves due to their complexity. Adaboost’s
sensitivity to learning rate affected its true positive rate, while GNB’s assumption of fea-
ture independence led to a 0.0 rate. Initial accuracies varied from 49% to 99%, improving
after hyperparameter tuning. GNB remained at 50% accuracy, indicating its unsuitability
for complex tasks. Recommended hyperparameters enhanced Random Forest accuracy
to 99.91%. Continuous testing against new files showed sustained performance and
incremental improvements post-optimization. The research highlights the importance
of careful algorithm selection and hyperparameter tuning for optimal performance in
malware detection.
6 Conclusion
In this study, a malware detection system was created using machine learning tech-
niques, with a focus on the crucial roles of data pre-processing, feature selection, and
model selection through Hyperparameter Tuning. The research underscored the signif-
icance of implementing data balancing and data engineering methods to ensure model
integrity. Notably, various algorithms were thoroughly evaluated based on metrics such
as accuracy, F1-score, recall, precision, and support. The experimental results unequiv-
ocally showcased the superior performance of the Random Forest algorithm, achieving
an impressive accuracy rate of 99.91%. This highlights the potential of Random Forest
in detecting malware.
While the study’s results are promising, it is important to recognize that optimal
hyperparameters may vary depending on the dataset and algorithm used. Therefore,
customizing hyperparameters for each specific problem is recommended. Additionally,
the research stressed the critical importance of continuous evaluation against new, unseen
files, as high accuracy alone does not ensure comprehensive detection and prevention of
malicious software.
References
1. Radwan, A.M.: Machine learning techniques to detect maliciousness of portable executable
files. In: Proceedings - 2019 International Conference on Promising Electronic Technologies,
ICPET 2019 (2019). https://doi.org/10.1109/ICPET.2019.00023
2. Maleh, Y.: Malware Classification and Analysis Using Convolutional and Recurrent Neural
Network (2019). https://doi.org/10.4018/978-1-5225-7862-8.ch014
3. Schratz, P., Muenchow, J., Iturritxa, E., Richter, J., Brenning, A.: Hyperparameter tuning and
performance assessment of statistical and machine-learning algorithms using spatial data.
Ecol. Model. 406 (2019). https://doi.org/10.1016/j.ecolmodel.2019.06.002
4. Mallik, A., Khetarpal, A., Kumar, S.: ConRec: malware classification using convolutional
recurrence. J. Comput. Virol. Hack. Tech. 18(4) (2022). https://doi.org/10.1007/s11416-022-
00416-3
13.
Machine Learning Techniquesfor Enhanced Malware Detection 795
5. Pascanu, R., Stokes, J.W., Sanossian, H., Marinescu, M., Thomas, A.: Malware classification
with recurrent networks. In: ICASSP, IEEE International Conference on Acoustics, Speech
and Signal Processing – Proceedings (2015). https://doi.org/10.1109/ICASSP.2015.7178304
6. Chen, Y.H., Lin, S.C., Huang, S.C., Lei, C.L., Huang, C.Y.: Guided malware sample analysis
based on graph neural networks. IEEE Trans. Inf. Forensics Secur. 18 (2023). https://doi.org/
10.1109/TIFS.2023.3283913
7. Ucci, D., Aniello, L., Baldoni, R.: Survey of machine learning techniques for malware
analysis. Comput. Secur. 81 (2019). https://doi.org/10.1016/j.cose.2018.11.001
8. Chouchane, M.R., Lakhotia, A.: Using engine signature to detect metamorphic malware. In:
Proceedings of the 4th ACM Workshop on Recurring Malcode, WORM 2006. Co-Located
with the 13th ACM Conference on Computer and Communications Security, CCS 2006
(2006). https://doi.org/10.1145/1179542.1179558
9. Khodamoradi, P., Fazlali, M., Mardukhi, F., Nosrati, M.: Heuristic metamorphic malware
detection based on statistics of assembly instructions using classification algorithms. In: 18th
CSI International Symposium on Computer Architecture and Digital Systems, CADS 2015
(2016). https://doi.org/10.1109/CADS.2015.7377792
10. Kapratwar, A., Di Troia, F., Stamp, M.: Static and dynamic analysis of android malware.
In: ICISSP 2017 - Proceedings of the 3rd International Conference on Information Systems
Security and Privacy (2017). https://doi.org/10.5220/0006256706530662
11. Li, J., Sun, L., Yan, Q., Li, Z., Srisa-An, W., Ye, H.: Significant permission identification for
machine-learning-based android malware detection. IEEE Trans. Ind. Inform. 14(7) (2018).
https://doi.org/10.1109/TII.2017.2789219
12. Singh, P., Borgohain, S.K., Kumar, J.: Performance enhancement of SVM-based ml malware
detection model using data preprocessing. In: 2022 2nd International Conference on Emerging
Frontiers in Electrical and Electronic Technologies, ICEFEET 2022 (2022). https://doi.org/
10.1109/ICEFEET51821.2022.9848192
13. Yang, L., Liu, J.: TuningMalconv: malware detection with not just raw bytes. IEEE Access 8
(2020). https://doi.org/10.1109/ACCESS.2020.3014245
14. Karampudi,B.,Phanideep,D.M.,Reddy,V.M.K.,Subhashini,N.,Muthulakshmi,S.:Malware
analysis using machine learning. In: Abraham, A., Pllana, S., Casalino, G., Ma, K., Bajaj, A.
(eds.) ISDA 2022. LNNS, vol. 716, pp. 281–290. Springer, Cham (2023). https://doi.org/10.
1007/978-3-031-35501-1_28
15. Jin, B., Choi, J., Hong, J.B., Kim, H.: On the effectiveness of perturbations in generating
evasive malware variants. IEEE Access 11 (2023). https://doi.org/10.1109/ACCESS.2023.
3262265
16. Zhu, S., Zhang, Z., Yang, L., Song, L., Wang, G.: Benchmarking label dynamics of VirusTotal
engines. In: Proceedings of the ACM Conference on Computer and Communications Security
(2020). https://doi.org/10.1145/3372297.3420013
17. Maulat Nasri, N.N.: Android malware detection system using machine learning. Int. J. Adv.
Trends Comput. Sci. Eng. 9(1.5) (2020). https://doi.org/10.30534/ijatcse/2020/4691.52020
18. Akhtar, M.S., Feng, T.: Malware analysis and detection using machine learning algorithms.
Symmetry (Basel) 14(11) (2022). https://doi.org/10.3390/sym14112304
19. Jacobs, I.S., Bean, C.P.: Fine particles, thin films and exchange anisotropy (effects of finite
dimensions and interfaces on the basic properties of ferromagnets). In: Spin Arrangements
and Crystal Structure, Domains, and Micromagnetics (1963). https://doi.org/10.1016/b978-
0-12-575303-6.50013-0
20. Carlin, D., O’Kane, P., Sezer, S.: A cost analysis of machine learning using dynamic runtime
opcodes for malware detection. Comput. Secur. 85 (2019). https://doi.org/10.1016/j.cose.
2019.04.018
14.
796 W. ElMouhtadi et al.
21. On certain integrals of Lipschitz-Hankel type involving products of Bessel functions. Philos.
Trans. Roy. Soc. Lond. Ser. A Math. Phys. Sci. 247(935) (1955). https://doi.org/10.1098/rsta.
1955.0005
22. Ahmadi, M., Ulyanov, D., Semenov, S., Trofimov, M., Giacinto, G.: Novel feature extraction,
selection and fusion for effective malware family classification. In: CODASPY 2016 - Pro-
ceedings of the 6th ACM Conference on Data and Application Security and Privacy (2016).
https://doi.org/10.1145/2857705.2857713
23. Kramer, O.: Scikit-Learn. In: Studies in Big Data, vol. 20 (2016). https://doi.org/10.1007/
978-3-319-33383-0_5
View publication stats