See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/391644720
Machine Learning Techniques for Enhanced Malware Detection in Portable
Executable Files
Chapter · May 2025
DOI: 10.1007/978-3-031-88653-9_74
CITATIONS
0
READS
80
3 authors:
Walid El Mouhtadi
Université Sultan Moulay Slimane
4 PUBLICATIONS 1 CITATION
SEE PROFILE
Maleh Yassine
Université Sultan Moulay Slimane
300 PUBLICATIONS 1,655 CITATIONS
SEE PROFILE
Soufyane Mounir
National school os Applied Sciences, Morocco, Khouribga
33 PUBLICATIONS 248 CITATIONS
SEE PROFILE
All content following this page was uploaded by Maleh Yassine on 11 May 2025.
The user has requested enhancement of the downloaded file.
Machine Learning Techniques for Enhanced
Malware Detection in Portable Executable Files
Walid El Mouhtadi(B), Yassine Maleh, and Soufyane Mounir
LaSTI Laboratory, National School of Applied Sciences, Sultan Moulay Slimane University,
Khouribga, Morocco
elmouhtadiwalid@gmail.com, Yassine.maleh@ieee.org,
s.mounir@usms.ma
Abstract. This study aims to explore and showcase the strengths and weak-
nesses of diverse machine-learning approaches in classifying malware, with a
specific focus on portable executable (PE) files. Overcoming common challenges
in machine learning, such as overfitting and underfitting, is addressed through the
use of ensemble methods and preprocessing techniques, including feature selec-
tion and hyperparameter tuning. The main goal is to improve the performance of
classifiers in distinguishing between malicious and benign PE files. Through a
comparative analysis of machine learning methods like random forests, decision
trees, and gradient boosting, the research emphasizes the superiority of the random
forests algorithm, achieving an impressive accuracy rate of 99%. By thoroughly
assessing the merits and drawbacks of each algorithm, the study provides valuable
insights into effectively managing diverse malware categories. This paper under-
scores the importance of ensemble methods, feature engineering, and preprocess-
ing in enhancing classifier performance for malware classification, particularly in
the context of portable executable files.
Keywords: Malware Detection · Machine Learning · Optimization ·
Hyperparameter Tunning · Data Balancing · Feature Selection
1 Introduction
The rapid proliferation [1, 2] of malicious software poses significant challenges for
conventional signature-based detection systems. Machine learning algorithms offer a
promising solution by leveraging computational intelligence to analyze extensive data
and identify previously unseen malware variants [3]. However, the success of machine
learning in malware analysis depends on factors such as dataset quality, feature selection,
algorithm choice, and hyperparameter tuning [4]. The main research questions are:
• How well does our new ML method using PE file analysis compare to other methods?
• Can balancing the data improve ML models’ accuracy in detecting malware?
• How do pre-processing and selecting features for PE files affect ML algorithm’s
performance in finding malware?
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025
M. Ben Ahmed et al. (Eds.): SCA 2024, LNNS 1310, pp. 784–796, 2025.
https://doi.org/10.1007/978-3-031-88653-9_74
Machine Learning Techniques for Enhanced Malware Detection 785
• How can adjusting hyperparameters improve ML algorithms’ ability to detect
malware?
This research delves into the application of machine learning for malware clas-
sification, focusing specifically on Portable Executable (PE) files. With a dataset of
96,765 malware samples and 41,323 legitimate samples, we address imbalances and
complexities by employing feature selection techniques. Five machine learning algo-
rithms, including random forests and gradient boosting, were trained and tested, with
hyperparameter tuning significantly enhancing accuracy.
The primary contributions of this research include a thorough investigation into
machine learning for malware classification, particularly targeting PE files, and the
utilization of feature selection techniques to enhance model performance. The study
evaluates algorithms based on accuracy, precision, and performance on unseen files,
emphasizing the importance of effective pre-processing techniques.
The study aims to deepen our understanding of machine learning’s potential and lim-
itations in malware detection, addressing the effectiveness of data balancing, the impact
of pre-processing on portable executable files, and hyperparameter tuning. The paper
outlines an effective malware detection system, highlighting the significance of data
pre-processing, feature selection, and model selection. It introduces a novel perspective
by identifying optimal initial hyperparameter values, serving as a reference for future
research.
The paper will be organized as follows: Sect. 2 will review related works in the
field. Section 3 will detail the data and methods employed, including subsections on
data collection and model optimization, as well as data balancing, splitting, and feature
selection. Section 4 will present the results obtained from the analysis and subsequent
discussions on these findings. Following the results, Sect. 5 will provide further discus-
sion on the implications and significance of the results. Finally, Sect. 6 will conclude
the paper, summarizing the key findings and suggesting avenues for future research.
2 Related Works
Malware detection is a challenging task in cybersecurity [5–7], and various approaches
have been proposed to tackle this issue [8, 9]. Researchers have recently explored dif-
ferent methods, such as machine learning algorithms [10], signature-based protection,
heuristic-based methods and behavior-based methods [11]. However, previous surveys
have limitations regarding coverage of new research trends and the taxonomy of feature
types used for malware detection [12].
Kapratwar et al. [13] integrate static and dynamic analysis to understand Android
malware behavior comprehensively. Static analysis involves extracting permissions from
the manifest file, providing valuable insights into the app’s authorized actions. Dynamic
analysis, on the other hand, captures system calls during runtime, enabling a deeper
understanding of the code’s execution paths.
786 W. El Mouhtadi et al.
Karampudi [14] highlights the limitations of static techniques, such as the deci-
sion tree algorithm, in detecting malware. A hybrid approach is proposed to overcome
these constraints, integrating multiple machine learning (ML) algorithms using ensemble
learning techniques. The Ada Boost classifier is implemented to enhance the efficiency
of the hybrid model. The study evaluates the performance of various ML algorithms,
including Decision Tree, Gaussian Naive Bayes, Random Forest, and Linear SVM, for
analysingmalwarecontent.AlgorithmswithloweraccuracyareboostedusingAdaBoost
to improve performance, resulting in significant accuracy improvements. The study con-
cludes that the hybrid approach, combined with ensemble learning, yields more effective
and accurate results in malware detection.
Jin Li. [15] addressed the alarming growth rate of malicious apps in the mobile
ecosystem, particularly on the Android platform. The authors introduce SigPID, a mal-
ware detection system based on permission usage analysis. The evaluation shows that
using only 22 significant permissions, SigPID achieves high precision, recall, accu-
racy, and F-measure, similar to the baseline approach analyzing all permissions. More-
over, SigPID demonstrates superior effectiveness to other state-of-the-art approaches,
detecting a high percentage of malware in the dataset, including unknown/new malware
samples.
Singh et al. [16] conducted a study exploring the application of a support-vector-
machine (SVM) [17] based machine learning (ML) model for malware detection in
systems. The primary objective of their research was to improve the SVM-ML model’s
performance through data pre-processing. To achieve this, the researchers employed
four distinct data pre-processing techniques [18], These techniques were applied to
a well-established dataset known as the Portable Executable Header (PEH) [19, 20]
classification of malware (CLaMP) dataset.
3 Data and Methods
3.1 Data Collection and Model Optimization
In this research, a robust and efficient data processing and model optimization workflow
has been implemented to address this project’s specific challenges and requirements.
The approach incorporates advanced techniques tailored to tackle the issues associated
with unbalanced data and a large, diverse feature set, which can often result in complex
models prone to over-fitting.
The following schema (Fig. 1) outlines the comprehensive Data Processing and
Model Optimization Workflow developed for efficient processing and model enhance-
ment:
Machine Learning Techniques for Enhanced Malware Detection 787
Fig. 1. Data Processing and Model Optimization Workflow
In the quest for precise outcomes, the development of a top dataset is crucial [15].
This study meticulously compiled an extensive dataset comprising 138,042 instances
with 57 distinct features. Of these, 96,000 instances represented malicious files, while
41,000 were legitimate. The dataset was carefully curated from the Kaggle platform
and organized in an Excel file, capturing diverse features related to malware character-
istics, including Worms, Trojan Horses, Ransomware, Spyware, Adware, Keyloggers,
Rootkits, Botnets, Fileless Malware, and Macro Viruses from 2017 to 2023.
During dataset refinement, 54 portable executable characteristics were scrutinized.
The Extratree Classifier algorithm identified 14 crucial features, chosen to prevent
underfitting, overfitting, and bias, ensuring robust subsequent analyses.
To assess model efficiency, legitimate and malicious files from various online plat-
forms were collected. A comprehensive approach to data collection from VirusShare,
Malware Bazzar, and VirusTotal enhanced study comprehensiveness and result accuracy.
Addressing class imbalance, a Data Rectification technique adjusted class instances
for a representative distribution [21, 22]. Feature selection reduced dataset dimension-
ality for more efficient machine learning algorithms.
Hyperparameter tuning optimized model performance by systematically adjusting
algorithm-specific parameters. Trained models underwent evaluation using performance
metrics like accuracy, precision, recall, and F1-score.
Saved as files for practical use, the models allow easy deployment in real-world
scenarios. Evaluating model performance on a separate set of PE files provided insights
into their effectiveness and generalization capability.
Through systematic implementation and evaluation, this research aimed to enhance
malware detection accuracy and efficiency. The presented findings and methodologies
contribute to advancing malware analysis, offering valuable insights for cybersecurity
applications and future research. See Fig. 2 for an illustration of the advanced architecture
designed for accurate file classification.
788 W. El Mouhtadi et al.
Fig. 2. Advanced Architecture for Accurate Classification of Unseen Files
3.2 Data Balancing, Splitting and Feature Selection
This step ensures that the data used for training the model is balanced, meaning each
class has an equal number of samples. This is important because imbalanced data can
lead to a biased model toward the majority.
The graphs below in Fig. 3 and Fig. 4 show the percentage of samples in each class
before and after balancing the data:
Fig. 3. Before Balancing Fig. 4. After Balancing
• Before balancing: 138,088 samples (96765 Malicious and 41323 Legitimate samples)
Machine Learning Techniques for Enhanced Malware Detection 789
• After balancing: 82,646 samples for each class (41,323 Malicious and 41,323
Legitimate samples).
– Data Splitting:
Techniques like ExtraTreesClassifier utilize feature importance derived from deci-
sion trees to rank features based on their contribution to the model’s predictive
accuracy:
Let X be the input feature matrix of shape (n_samples, n_features) and y be the target
vector of shape (n_samples,), where:
– N_samples is the number of samples in the dataset.
– N_features is the number of features in each sample.
The ExtraTreesClassifier fits a set of decision trees T1, T2, … , Tn on different
random subsamples of X and y, and then aggregates their predictions through averaging.
– Feature Importance:
The importance of each feature is determined by the
extratrees.feature_importances_attribute, which provides the feature importance scores
calculated by the ExtraTreesClassifier.
– Printing Features:
The loop prints out the top features identified by the ExtraTreesClassifier along with
their corresponding importance.
The algorithm helped identify the most important features for accurate classification
by analyzing many potential features. As a result, 14 features were pinpointed as being
the most useful for determining whether a file was malicious. By using these 14 fea-
tures, the classification task achieved high accuracy. This highlights the effectiveness of
ExtraTreesClassifier in selecting the required features for classifying files.
The visualization below in (Fig. 5) displays the 14 most important features identified
by the ExtraTreesClassifier for accurate classification:
4 Results and Discussion
4.1 Performance Evaluation
The experiments were conducted on a computer with an Intel Core i7 CPU, 16 GB RAM,
and Intel® UHD GPU with 4.1 GB dedicated video memory. The setup used Windows
10, Python 3.8.5, and various libraries. Data, stored locally, underwent training and
evaluation on machine learning algorithms, including Random Forest, Decision Tree,
Adaboost, Gradient Boosting, and GNB, with insights derived from accuracy assess-
ments. The comparison parameters include True Positive Rate (TPR), False Positive
Rate (FPR), Precision, and Accuracy.
790 W. El Mouhtadi et al.
Fig. 5. Importance of the relevant feature
Before delving into the fine-tuning process, let’s first explore how our models per-
formed with their initial, out-of-the-box configurations. The confusion matrices unveiled
at this stage will serve as our benchmark, providing a baseline for the subsequent
enhancements.
Table 2 below displays the values of performance criteria for every algorithm that
was tested:
Table 2. Criteria for Initial Models 1
Algorithm Accuracy Precision TPR FNR
DecisionTree 0.99062311 0.988 0.992 0.012
RandomForest 0.99165154 0.992 0.995 0.008
Adaboost 0.98324258 0.984 0.982 0.016
GradientBoosting 0.98505747 0.987 0.987 0.013
GNB 0.40505747 0.477 0.478 0.011
4.2 Optimized Hyperparameters for Various Models
The hyperparameters chosen for each model were carefully selected to strike a balance
between bias and variance, aiming to prevent overfitting while capturing the complexity
of the data. These choices were made through a thorough analysis of data characteristics,
iterative testing, and a deep understanding of each hyperparameter.
For example, specific values were assigned to hyperparameters such as max_features
and n_estimators, with both set to 15 to encompass all 14 essential features. Lower values
for min_samples_split (2, 5, 10) were chosen after extensive testing to enhance accuracy.
The selection of max_depth values (5, 10, 15) was influenced by the feature selection
process, prioritizing precision and simplicity with the 14 chosen features.
Machine Learning Techniques for Enhanced Malware Detection 791
It’s important to note that using only 2 parameters would rely on just 34% of the data’s
importance, as depicted in Fig. 5 - Importance of Relevant Features. By incorporating
at least 5 features, decisions were based on at least 48% of the data’s importance and
feature variance, ensuring a more comprehensive understanding of the dataset.
This Table 3 below provides a comprehensive reference for the parameters we’ve
chosen to optimize, offering valuable insights into our model enhancement process:
Table 3. Performance for Initial Models
Model The Recommended Hyperparameters Before Tunning
DecisionTree {“max depth”: [5, 10, 15], “min samples leaf”: [1, 5, 10]}
RandomForest { “n estimators”: [5, 10, 15],”max depth”: [5, 10, 15],
“max features”: [5, 10, 15], “min samples split”: [2, 5, 10]}
Adaboost {“n estimators”: [5, 10, 15], “learning rate”: [0.1, 0.5, 1.0]}
GradientBoosting {“n estimators”: [5, 10, 15], “learning rate”: [0.1, 0.5, 1.0]}
GNB {}
4.3 Performance Evaluation with Optimized Hyperparameters
After meticulous hyperparameter tuning, we re-evaluate the models to reveal the results
of our optimization efforts. The newly refined confusion matrices now serve as evidence
of our dedication to harnessing the full potential of each model. Here, we observe the
transition from the initial state to the enhanced state, where each matrix symbolizes the
culmination of precision and robustness.
The confusion matrix below (Fig. 6a to Fig. 6e) presents the true labels, including
true/false positives and true/false negatives, for the predicted labels:
792 W. El Mouhtadi et al.
(a) (b)
(c) (d)
(e)
Fig. 6. a. Confusion Matrix for DecisionTree. b. Confusion Matrix for RandomForest. c. Con-
fusion Matrix for Adaboost. d. Confusion Matrix for GradientBoosting. e. Confusion Matrix for
GNB
In Table 4 below, we provide a summary of the best hyperparameters for each model
after tuning:
Machine Learning Techniques for Enhanced Malware Detection 793
Table 4. Best Hyperparameters for Models
Model Selected Hyperparameters After Tuning
DecisionTree {“max depth”: 10, “min samples leaf”: 1}
RandomForest {“n estimators”: 15, “max depth”: 15, “max features”: 5, “min samples
split”: 2}
Adaboost {“n estimators”: 15, “learning rate”: 0.5}
GradientBoosting {“n estimators”: 15, “learning rate”: 1.0}
GNB {}
Table 5 displays the values of performance criteria for every algorithm after tunning
(Fig. 7):
Table 5. After Tunning
Algorithm Accuracy Precision True positive Rate False Negative Rate
DecisionTree 0.994007 0.988 0.992 0.012
RandomForest 0.999143 0.992 0.995 0.008
Adaboost 0.99776/7 0.984 0.982 0.016
GradientBoosting 0.990260 0.987 0.987 0.013
GNB 0.504597 1.000 0.000 0.000
Fig. 7. ROC Curve Graph of models
794 W. El Mouhtadi et al.
5 Discussion
The study focused on evaluating machine learning algorithms for detecting malware
using ROC curves. DecisionTree, RandomForest, and GradientBoosting showed perfect
true positive rates but might be prone to overfitting or dataset specificity. Adaboost and
Gaussian Naive Bayes (GNB) lacked ROC curves due to their complexity. Adaboost’s
sensitivity to learning rate affected its true positive rate, while GNB’s assumption of fea-
ture independence led to a 0.0 rate. Initial accuracies varied from 49% to 99%, improving
after hyperparameter tuning. GNB remained at 50% accuracy, indicating its unsuitability
for complex tasks. Recommended hyperparameters enhanced Random Forest accuracy
to 99.91%. Continuous testing against new files showed sustained performance and
incremental improvements post-optimization. The research highlights the importance
of careful algorithm selection and hyperparameter tuning for optimal performance in
malware detection.
6 Conclusion
In this study, a malware detection system was created using machine learning tech-
niques, with a focus on the crucial roles of data pre-processing, feature selection, and
model selection through Hyperparameter Tuning. The research underscored the signif-
icance of implementing data balancing and data engineering methods to ensure model
integrity. Notably, various algorithms were thoroughly evaluated based on metrics such
as accuracy, F1-score, recall, precision, and support. The experimental results unequiv-
ocally showcased the superior performance of the Random Forest algorithm, achieving
an impressive accuracy rate of 99.91%. This highlights the potential of Random Forest
in detecting malware.
While the study’s results are promising, it is important to recognize that optimal
hyperparameters may vary depending on the dataset and algorithm used. Therefore,
customizing hyperparameters for each specific problem is recommended. Additionally,
the research stressed the critical importance of continuous evaluation against new, unseen
files, as high accuracy alone does not ensure comprehensive detection and prevention of
malicious software.
References
1. Radwan, A.M.: Machine learning techniques to detect maliciousness of portable executable
files. In: Proceedings - 2019 International Conference on Promising Electronic Technologies,
ICPET 2019 (2019). https://doi.org/10.1109/ICPET.2019.00023
2. Maleh, Y.: Malware Classification and Analysis Using Convolutional and Recurrent Neural
Network (2019). https://doi.org/10.4018/978-1-5225-7862-8.ch014
3. Schratz, P., Muenchow, J., Iturritxa, E., Richter, J., Brenning, A.: Hyperparameter tuning and
performance assessment of statistical and machine-learning algorithms using spatial data.
Ecol. Model. 406 (2019). https://doi.org/10.1016/j.ecolmodel.2019.06.002
4. Mallik, A., Khetarpal, A., Kumar, S.: ConRec: malware classification using convolutional
recurrence. J. Comput. Virol. Hack. Tech. 18(4) (2022). https://doi.org/10.1007/s11416-022-
00416-3
Machine Learning Techniques for Enhanced Malware Detection 795
5. Pascanu, R., Stokes, J.W., Sanossian, H., Marinescu, M., Thomas, A.: Malware classification
with recurrent networks. In: ICASSP, IEEE International Conference on Acoustics, Speech
and Signal Processing – Proceedings (2015). https://doi.org/10.1109/ICASSP.2015.7178304
6. Chen, Y.H., Lin, S.C., Huang, S.C., Lei, C.L., Huang, C.Y.: Guided malware sample analysis
based on graph neural networks. IEEE Trans. Inf. Forensics Secur. 18 (2023). https://doi.org/
10.1109/TIFS.2023.3283913
7. Ucci, D., Aniello, L., Baldoni, R.: Survey of machine learning techniques for malware
analysis. Comput. Secur. 81 (2019). https://doi.org/10.1016/j.cose.2018.11.001
8. Chouchane, M.R., Lakhotia, A.: Using engine signature to detect metamorphic malware. In:
Proceedings of the 4th ACM Workshop on Recurring Malcode, WORM 2006. Co-Located
with the 13th ACM Conference on Computer and Communications Security, CCS 2006
(2006). https://doi.org/10.1145/1179542.1179558
9. Khodamoradi, P., Fazlali, M., Mardukhi, F., Nosrati, M.: Heuristic metamorphic malware
detection based on statistics of assembly instructions using classification algorithms. In: 18th
CSI International Symposium on Computer Architecture and Digital Systems, CADS 2015
(2016). https://doi.org/10.1109/CADS.2015.7377792
10. Kapratwar, A., Di Troia, F., Stamp, M.: Static and dynamic analysis of android malware.
In: ICISSP 2017 - Proceedings of the 3rd International Conference on Information Systems
Security and Privacy (2017). https://doi.org/10.5220/0006256706530662
11. Li, J., Sun, L., Yan, Q., Li, Z., Srisa-An, W., Ye, H.: Significant permission identification for
machine-learning-based android malware detection. IEEE Trans. Ind. Inform. 14(7) (2018).
https://doi.org/10.1109/TII.2017.2789219
12. Singh, P., Borgohain, S.K., Kumar, J.: Performance enhancement of SVM-based ml malware
detection model using data preprocessing. In: 2022 2nd International Conference on Emerging
Frontiers in Electrical and Electronic Technologies, ICEFEET 2022 (2022). https://doi.org/
10.1109/ICEFEET51821.2022.9848192
13. Yang, L., Liu, J.: TuningMalconv: malware detection with not just raw bytes. IEEE Access 8
(2020). https://doi.org/10.1109/ACCESS.2020.3014245
14. Karampudi,B.,Phanideep,D.M.,Reddy,V.M.K.,Subhashini,N.,Muthulakshmi,S.:Malware
analysis using machine learning. In: Abraham, A., Pllana, S., Casalino, G., Ma, K., Bajaj, A.
(eds.) ISDA 2022. LNNS, vol. 716, pp. 281–290. Springer, Cham (2023). https://doi.org/10.
1007/978-3-031-35501-1_28
15. Jin, B., Choi, J., Hong, J.B., Kim, H.: On the effectiveness of perturbations in generating
evasive malware variants. IEEE Access 11 (2023). https://doi.org/10.1109/ACCESS.2023.
3262265
16. Zhu, S., Zhang, Z., Yang, L., Song, L., Wang, G.: Benchmarking label dynamics of VirusTotal
engines. In: Proceedings of the ACM Conference on Computer and Communications Security
(2020). https://doi.org/10.1145/3372297.3420013
17. Maulat Nasri, N.N.: Android malware detection system using machine learning. Int. J. Adv.
Trends Comput. Sci. Eng. 9(1.5) (2020). https://doi.org/10.30534/ijatcse/2020/4691.52020
18. Akhtar, M.S., Feng, T.: Malware analysis and detection using machine learning algorithms.
Symmetry (Basel) 14(11) (2022). https://doi.org/10.3390/sym14112304
19. Jacobs, I.S., Bean, C.P.: Fine particles, thin films and exchange anisotropy (effects of finite
dimensions and interfaces on the basic properties of ferromagnets). In: Spin Arrangements
and Crystal Structure, Domains, and Micromagnetics (1963). https://doi.org/10.1016/b978-
0-12-575303-6.50013-0
20. Carlin, D., O’Kane, P., Sezer, S.: A cost analysis of machine learning using dynamic runtime
opcodes for malware detection. Comput. Secur. 85 (2019). https://doi.org/10.1016/j.cose.
2019.04.018
796 W. El Mouhtadi et al.
21. On certain integrals of Lipschitz-Hankel type involving products of Bessel functions. Philos.
Trans. Roy. Soc. Lond. Ser. A Math. Phys. Sci. 247(935) (1955). https://doi.org/10.1098/rsta.
1955.0005
22. Ahmadi, M., Ulyanov, D., Semenov, S., Trofimov, M., Giacinto, G.: Novel feature extraction,
selection and fusion for effective malware family classification. In: CODASPY 2016 - Pro-
ceedings of the 6th ACM Conference on Data and Application Security and Privacy (2016).
https://doi.org/10.1145/2857705.2857713
23. Kramer, O.: Scikit-Learn. In: Studies in Big Data, vol. 20 (2016). https://doi.org/10.1007/
978-3-319-33383-0_5
View publication stats

978-3-031-88653-9_74aaaaaaaaaaaaaaaaaaaaaaaaaaaaa.pdf

  • 1.
    See discussions, stats,and author profiles for this publication at: https://www.researchgate.net/publication/391644720 Machine Learning Techniques for Enhanced Malware Detection in Portable Executable Files Chapter · May 2025 DOI: 10.1007/978-3-031-88653-9_74 CITATIONS 0 READS 80 3 authors: Walid El Mouhtadi Université Sultan Moulay Slimane 4 PUBLICATIONS 1 CITATION SEE PROFILE Maleh Yassine Université Sultan Moulay Slimane 300 PUBLICATIONS 1,655 CITATIONS SEE PROFILE Soufyane Mounir National school os Applied Sciences, Morocco, Khouribga 33 PUBLICATIONS 248 CITATIONS SEE PROFILE All content following this page was uploaded by Maleh Yassine on 11 May 2025. The user has requested enhancement of the downloaded file.
  • 2.
    Machine Learning Techniquesfor Enhanced Malware Detection in Portable Executable Files Walid El Mouhtadi(B), Yassine Maleh, and Soufyane Mounir LaSTI Laboratory, National School of Applied Sciences, Sultan Moulay Slimane University, Khouribga, Morocco elmouhtadiwalid@gmail.com, Yassine.maleh@ieee.org, s.mounir@usms.ma Abstract. This study aims to explore and showcase the strengths and weak- nesses of diverse machine-learning approaches in classifying malware, with a specific focus on portable executable (PE) files. Overcoming common challenges in machine learning, such as overfitting and underfitting, is addressed through the use of ensemble methods and preprocessing techniques, including feature selec- tion and hyperparameter tuning. The main goal is to improve the performance of classifiers in distinguishing between malicious and benign PE files. Through a comparative analysis of machine learning methods like random forests, decision trees, and gradient boosting, the research emphasizes the superiority of the random forests algorithm, achieving an impressive accuracy rate of 99%. By thoroughly assessing the merits and drawbacks of each algorithm, the study provides valuable insights into effectively managing diverse malware categories. This paper under- scores the importance of ensemble methods, feature engineering, and preprocess- ing in enhancing classifier performance for malware classification, particularly in the context of portable executable files. Keywords: Malware Detection · Machine Learning · Optimization · Hyperparameter Tunning · Data Balancing · Feature Selection 1 Introduction The rapid proliferation [1, 2] of malicious software poses significant challenges for conventional signature-based detection systems. Machine learning algorithms offer a promising solution by leveraging computational intelligence to analyze extensive data and identify previously unseen malware variants [3]. However, the success of machine learning in malware analysis depends on factors such as dataset quality, feature selection, algorithm choice, and hyperparameter tuning [4]. The main research questions are: • How well does our new ML method using PE file analysis compare to other methods? • Can balancing the data improve ML models’ accuracy in detecting malware? • How do pre-processing and selecting features for PE files affect ML algorithm’s performance in finding malware? © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025 M. Ben Ahmed et al. (Eds.): SCA 2024, LNNS 1310, pp. 784–796, 2025. https://doi.org/10.1007/978-3-031-88653-9_74
  • 3.
    Machine Learning Techniquesfor Enhanced Malware Detection 785 • How can adjusting hyperparameters improve ML algorithms’ ability to detect malware? This research delves into the application of machine learning for malware clas- sification, focusing specifically on Portable Executable (PE) files. With a dataset of 96,765 malware samples and 41,323 legitimate samples, we address imbalances and complexities by employing feature selection techniques. Five machine learning algo- rithms, including random forests and gradient boosting, were trained and tested, with hyperparameter tuning significantly enhancing accuracy. The primary contributions of this research include a thorough investigation into machine learning for malware classification, particularly targeting PE files, and the utilization of feature selection techniques to enhance model performance. The study evaluates algorithms based on accuracy, precision, and performance on unseen files, emphasizing the importance of effective pre-processing techniques. The study aims to deepen our understanding of machine learning’s potential and lim- itations in malware detection, addressing the effectiveness of data balancing, the impact of pre-processing on portable executable files, and hyperparameter tuning. The paper outlines an effective malware detection system, highlighting the significance of data pre-processing, feature selection, and model selection. It introduces a novel perspective by identifying optimal initial hyperparameter values, serving as a reference for future research. The paper will be organized as follows: Sect. 2 will review related works in the field. Section 3 will detail the data and methods employed, including subsections on data collection and model optimization, as well as data balancing, splitting, and feature selection. Section 4 will present the results obtained from the analysis and subsequent discussions on these findings. Following the results, Sect. 5 will provide further discus- sion on the implications and significance of the results. Finally, Sect. 6 will conclude the paper, summarizing the key findings and suggesting avenues for future research. 2 Related Works Malware detection is a challenging task in cybersecurity [5–7], and various approaches have been proposed to tackle this issue [8, 9]. Researchers have recently explored dif- ferent methods, such as machine learning algorithms [10], signature-based protection, heuristic-based methods and behavior-based methods [11]. However, previous surveys have limitations regarding coverage of new research trends and the taxonomy of feature types used for malware detection [12]. Kapratwar et al. [13] integrate static and dynamic analysis to understand Android malware behavior comprehensively. Static analysis involves extracting permissions from the manifest file, providing valuable insights into the app’s authorized actions. Dynamic analysis, on the other hand, captures system calls during runtime, enabling a deeper understanding of the code’s execution paths.
  • 4.
    786 W. ElMouhtadi et al. Karampudi [14] highlights the limitations of static techniques, such as the deci- sion tree algorithm, in detecting malware. A hybrid approach is proposed to overcome these constraints, integrating multiple machine learning (ML) algorithms using ensemble learning techniques. The Ada Boost classifier is implemented to enhance the efficiency of the hybrid model. The study evaluates the performance of various ML algorithms, including Decision Tree, Gaussian Naive Bayes, Random Forest, and Linear SVM, for analysingmalwarecontent.AlgorithmswithloweraccuracyareboostedusingAdaBoost to improve performance, resulting in significant accuracy improvements. The study con- cludes that the hybrid approach, combined with ensemble learning, yields more effective and accurate results in malware detection. Jin Li. [15] addressed the alarming growth rate of malicious apps in the mobile ecosystem, particularly on the Android platform. The authors introduce SigPID, a mal- ware detection system based on permission usage analysis. The evaluation shows that using only 22 significant permissions, SigPID achieves high precision, recall, accu- racy, and F-measure, similar to the baseline approach analyzing all permissions. More- over, SigPID demonstrates superior effectiveness to other state-of-the-art approaches, detecting a high percentage of malware in the dataset, including unknown/new malware samples. Singh et al. [16] conducted a study exploring the application of a support-vector- machine (SVM) [17] based machine learning (ML) model for malware detection in systems. The primary objective of their research was to improve the SVM-ML model’s performance through data pre-processing. To achieve this, the researchers employed four distinct data pre-processing techniques [18], These techniques were applied to a well-established dataset known as the Portable Executable Header (PEH) [19, 20] classification of malware (CLaMP) dataset. 3 Data and Methods 3.1 Data Collection and Model Optimization In this research, a robust and efficient data processing and model optimization workflow has been implemented to address this project’s specific challenges and requirements. The approach incorporates advanced techniques tailored to tackle the issues associated with unbalanced data and a large, diverse feature set, which can often result in complex models prone to over-fitting. The following schema (Fig. 1) outlines the comprehensive Data Processing and Model Optimization Workflow developed for efficient processing and model enhance- ment:
  • 5.
    Machine Learning Techniquesfor Enhanced Malware Detection 787 Fig. 1. Data Processing and Model Optimization Workflow In the quest for precise outcomes, the development of a top dataset is crucial [15]. This study meticulously compiled an extensive dataset comprising 138,042 instances with 57 distinct features. Of these, 96,000 instances represented malicious files, while 41,000 were legitimate. The dataset was carefully curated from the Kaggle platform and organized in an Excel file, capturing diverse features related to malware character- istics, including Worms, Trojan Horses, Ransomware, Spyware, Adware, Keyloggers, Rootkits, Botnets, Fileless Malware, and Macro Viruses from 2017 to 2023. During dataset refinement, 54 portable executable characteristics were scrutinized. The Extratree Classifier algorithm identified 14 crucial features, chosen to prevent underfitting, overfitting, and bias, ensuring robust subsequent analyses. To assess model efficiency, legitimate and malicious files from various online plat- forms were collected. A comprehensive approach to data collection from VirusShare, Malware Bazzar, and VirusTotal enhanced study comprehensiveness and result accuracy. Addressing class imbalance, a Data Rectification technique adjusted class instances for a representative distribution [21, 22]. Feature selection reduced dataset dimension- ality for more efficient machine learning algorithms. Hyperparameter tuning optimized model performance by systematically adjusting algorithm-specific parameters. Trained models underwent evaluation using performance metrics like accuracy, precision, recall, and F1-score. Saved as files for practical use, the models allow easy deployment in real-world scenarios. Evaluating model performance on a separate set of PE files provided insights into their effectiveness and generalization capability. Through systematic implementation and evaluation, this research aimed to enhance malware detection accuracy and efficiency. The presented findings and methodologies contribute to advancing malware analysis, offering valuable insights for cybersecurity applications and future research. See Fig. 2 for an illustration of the advanced architecture designed for accurate file classification.
  • 6.
    788 W. ElMouhtadi et al. Fig. 2. Advanced Architecture for Accurate Classification of Unseen Files 3.2 Data Balancing, Splitting and Feature Selection This step ensures that the data used for training the model is balanced, meaning each class has an equal number of samples. This is important because imbalanced data can lead to a biased model toward the majority. The graphs below in Fig. 3 and Fig. 4 show the percentage of samples in each class before and after balancing the data: Fig. 3. Before Balancing Fig. 4. After Balancing • Before balancing: 138,088 samples (96765 Malicious and 41323 Legitimate samples)
  • 7.
    Machine Learning Techniquesfor Enhanced Malware Detection 789 • After balancing: 82,646 samples for each class (41,323 Malicious and 41,323 Legitimate samples). – Data Splitting: Techniques like ExtraTreesClassifier utilize feature importance derived from deci- sion trees to rank features based on their contribution to the model’s predictive accuracy: Let X be the input feature matrix of shape (n_samples, n_features) and y be the target vector of shape (n_samples,), where: – N_samples is the number of samples in the dataset. – N_features is the number of features in each sample. The ExtraTreesClassifier fits a set of decision trees T1, T2, … , Tn on different random subsamples of X and y, and then aggregates their predictions through averaging. – Feature Importance: The importance of each feature is determined by the extratrees.feature_importances_attribute, which provides the feature importance scores calculated by the ExtraTreesClassifier. – Printing Features: The loop prints out the top features identified by the ExtraTreesClassifier along with their corresponding importance. The algorithm helped identify the most important features for accurate classification by analyzing many potential features. As a result, 14 features were pinpointed as being the most useful for determining whether a file was malicious. By using these 14 fea- tures, the classification task achieved high accuracy. This highlights the effectiveness of ExtraTreesClassifier in selecting the required features for classifying files. The visualization below in (Fig. 5) displays the 14 most important features identified by the ExtraTreesClassifier for accurate classification: 4 Results and Discussion 4.1 Performance Evaluation The experiments were conducted on a computer with an Intel Core i7 CPU, 16 GB RAM, and Intel® UHD GPU with 4.1 GB dedicated video memory. The setup used Windows 10, Python 3.8.5, and various libraries. Data, stored locally, underwent training and evaluation on machine learning algorithms, including Random Forest, Decision Tree, Adaboost, Gradient Boosting, and GNB, with insights derived from accuracy assess- ments. The comparison parameters include True Positive Rate (TPR), False Positive Rate (FPR), Precision, and Accuracy.
  • 8.
    790 W. ElMouhtadi et al. Fig. 5. Importance of the relevant feature Before delving into the fine-tuning process, let’s first explore how our models per- formed with their initial, out-of-the-box configurations. The confusion matrices unveiled at this stage will serve as our benchmark, providing a baseline for the subsequent enhancements. Table 2 below displays the values of performance criteria for every algorithm that was tested: Table 2. Criteria for Initial Models 1 Algorithm Accuracy Precision TPR FNR DecisionTree 0.99062311 0.988 0.992 0.012 RandomForest 0.99165154 0.992 0.995 0.008 Adaboost 0.98324258 0.984 0.982 0.016 GradientBoosting 0.98505747 0.987 0.987 0.013 GNB 0.40505747 0.477 0.478 0.011 4.2 Optimized Hyperparameters for Various Models The hyperparameters chosen for each model were carefully selected to strike a balance between bias and variance, aiming to prevent overfitting while capturing the complexity of the data. These choices were made through a thorough analysis of data characteristics, iterative testing, and a deep understanding of each hyperparameter. For example, specific values were assigned to hyperparameters such as max_features and n_estimators, with both set to 15 to encompass all 14 essential features. Lower values for min_samples_split (2, 5, 10) were chosen after extensive testing to enhance accuracy. The selection of max_depth values (5, 10, 15) was influenced by the feature selection process, prioritizing precision and simplicity with the 14 chosen features.
  • 9.
    Machine Learning Techniquesfor Enhanced Malware Detection 791 It’s important to note that using only 2 parameters would rely on just 34% of the data’s importance, as depicted in Fig. 5 - Importance of Relevant Features. By incorporating at least 5 features, decisions were based on at least 48% of the data’s importance and feature variance, ensuring a more comprehensive understanding of the dataset. This Table 3 below provides a comprehensive reference for the parameters we’ve chosen to optimize, offering valuable insights into our model enhancement process: Table 3. Performance for Initial Models Model The Recommended Hyperparameters Before Tunning DecisionTree {“max depth”: [5, 10, 15], “min samples leaf”: [1, 5, 10]} RandomForest { “n estimators”: [5, 10, 15],”max depth”: [5, 10, 15], “max features”: [5, 10, 15], “min samples split”: [2, 5, 10]} Adaboost {“n estimators”: [5, 10, 15], “learning rate”: [0.1, 0.5, 1.0]} GradientBoosting {“n estimators”: [5, 10, 15], “learning rate”: [0.1, 0.5, 1.0]} GNB {} 4.3 Performance Evaluation with Optimized Hyperparameters After meticulous hyperparameter tuning, we re-evaluate the models to reveal the results of our optimization efforts. The newly refined confusion matrices now serve as evidence of our dedication to harnessing the full potential of each model. Here, we observe the transition from the initial state to the enhanced state, where each matrix symbolizes the culmination of precision and robustness. The confusion matrix below (Fig. 6a to Fig. 6e) presents the true labels, including true/false positives and true/false negatives, for the predicted labels:
  • 10.
    792 W. ElMouhtadi et al. (a) (b) (c) (d) (e) Fig. 6. a. Confusion Matrix for DecisionTree. b. Confusion Matrix for RandomForest. c. Con- fusion Matrix for Adaboost. d. Confusion Matrix for GradientBoosting. e. Confusion Matrix for GNB In Table 4 below, we provide a summary of the best hyperparameters for each model after tuning:
  • 11.
    Machine Learning Techniquesfor Enhanced Malware Detection 793 Table 4. Best Hyperparameters for Models Model Selected Hyperparameters After Tuning DecisionTree {“max depth”: 10, “min samples leaf”: 1} RandomForest {“n estimators”: 15, “max depth”: 15, “max features”: 5, “min samples split”: 2} Adaboost {“n estimators”: 15, “learning rate”: 0.5} GradientBoosting {“n estimators”: 15, “learning rate”: 1.0} GNB {} Table 5 displays the values of performance criteria for every algorithm after tunning (Fig. 7): Table 5. After Tunning Algorithm Accuracy Precision True positive Rate False Negative Rate DecisionTree 0.994007 0.988 0.992 0.012 RandomForest 0.999143 0.992 0.995 0.008 Adaboost 0.99776/7 0.984 0.982 0.016 GradientBoosting 0.990260 0.987 0.987 0.013 GNB 0.504597 1.000 0.000 0.000 Fig. 7. ROC Curve Graph of models
  • 12.
    794 W. ElMouhtadi et al. 5 Discussion The study focused on evaluating machine learning algorithms for detecting malware using ROC curves. DecisionTree, RandomForest, and GradientBoosting showed perfect true positive rates but might be prone to overfitting or dataset specificity. Adaboost and Gaussian Naive Bayes (GNB) lacked ROC curves due to their complexity. Adaboost’s sensitivity to learning rate affected its true positive rate, while GNB’s assumption of fea- ture independence led to a 0.0 rate. Initial accuracies varied from 49% to 99%, improving after hyperparameter tuning. GNB remained at 50% accuracy, indicating its unsuitability for complex tasks. Recommended hyperparameters enhanced Random Forest accuracy to 99.91%. Continuous testing against new files showed sustained performance and incremental improvements post-optimization. The research highlights the importance of careful algorithm selection and hyperparameter tuning for optimal performance in malware detection. 6 Conclusion In this study, a malware detection system was created using machine learning tech- niques, with a focus on the crucial roles of data pre-processing, feature selection, and model selection through Hyperparameter Tuning. The research underscored the signif- icance of implementing data balancing and data engineering methods to ensure model integrity. Notably, various algorithms were thoroughly evaluated based on metrics such as accuracy, F1-score, recall, precision, and support. The experimental results unequiv- ocally showcased the superior performance of the Random Forest algorithm, achieving an impressive accuracy rate of 99.91%. This highlights the potential of Random Forest in detecting malware. While the study’s results are promising, it is important to recognize that optimal hyperparameters may vary depending on the dataset and algorithm used. Therefore, customizing hyperparameters for each specific problem is recommended. Additionally, the research stressed the critical importance of continuous evaluation against new, unseen files, as high accuracy alone does not ensure comprehensive detection and prevention of malicious software. References 1. Radwan, A.M.: Machine learning techniques to detect maliciousness of portable executable files. In: Proceedings - 2019 International Conference on Promising Electronic Technologies, ICPET 2019 (2019). https://doi.org/10.1109/ICPET.2019.00023 2. Maleh, Y.: Malware Classification and Analysis Using Convolutional and Recurrent Neural Network (2019). https://doi.org/10.4018/978-1-5225-7862-8.ch014 3. Schratz, P., Muenchow, J., Iturritxa, E., Richter, J., Brenning, A.: Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data. Ecol. Model. 406 (2019). https://doi.org/10.1016/j.ecolmodel.2019.06.002 4. Mallik, A., Khetarpal, A., Kumar, S.: ConRec: malware classification using convolutional recurrence. J. Comput. Virol. Hack. Tech. 18(4) (2022). https://doi.org/10.1007/s11416-022- 00416-3
  • 13.
    Machine Learning Techniquesfor Enhanced Malware Detection 795 5. Pascanu, R., Stokes, J.W., Sanossian, H., Marinescu, M., Thomas, A.: Malware classification with recurrent networks. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing – Proceedings (2015). https://doi.org/10.1109/ICASSP.2015.7178304 6. Chen, Y.H., Lin, S.C., Huang, S.C., Lei, C.L., Huang, C.Y.: Guided malware sample analysis based on graph neural networks. IEEE Trans. Inf. Forensics Secur. 18 (2023). https://doi.org/ 10.1109/TIFS.2023.3283913 7. Ucci, D., Aniello, L., Baldoni, R.: Survey of machine learning techniques for malware analysis. Comput. Secur. 81 (2019). https://doi.org/10.1016/j.cose.2018.11.001 8. Chouchane, M.R., Lakhotia, A.: Using engine signature to detect metamorphic malware. In: Proceedings of the 4th ACM Workshop on Recurring Malcode, WORM 2006. Co-Located with the 13th ACM Conference on Computer and Communications Security, CCS 2006 (2006). https://doi.org/10.1145/1179542.1179558 9. Khodamoradi, P., Fazlali, M., Mardukhi, F., Nosrati, M.: Heuristic metamorphic malware detection based on statistics of assembly instructions using classification algorithms. In: 18th CSI International Symposium on Computer Architecture and Digital Systems, CADS 2015 (2016). https://doi.org/10.1109/CADS.2015.7377792 10. Kapratwar, A., Di Troia, F., Stamp, M.: Static and dynamic analysis of android malware. In: ICISSP 2017 - Proceedings of the 3rd International Conference on Information Systems Security and Privacy (2017). https://doi.org/10.5220/0006256706530662 11. Li, J., Sun, L., Yan, Q., Li, Z., Srisa-An, W., Ye, H.: Significant permission identification for machine-learning-based android malware detection. IEEE Trans. Ind. Inform. 14(7) (2018). https://doi.org/10.1109/TII.2017.2789219 12. Singh, P., Borgohain, S.K., Kumar, J.: Performance enhancement of SVM-based ml malware detection model using data preprocessing. In: 2022 2nd International Conference on Emerging Frontiers in Electrical and Electronic Technologies, ICEFEET 2022 (2022). https://doi.org/ 10.1109/ICEFEET51821.2022.9848192 13. Yang, L., Liu, J.: TuningMalconv: malware detection with not just raw bytes. IEEE Access 8 (2020). https://doi.org/10.1109/ACCESS.2020.3014245 14. Karampudi,B.,Phanideep,D.M.,Reddy,V.M.K.,Subhashini,N.,Muthulakshmi,S.:Malware analysis using machine learning. In: Abraham, A., Pllana, S., Casalino, G., Ma, K., Bajaj, A. (eds.) ISDA 2022. LNNS, vol. 716, pp. 281–290. Springer, Cham (2023). https://doi.org/10. 1007/978-3-031-35501-1_28 15. Jin, B., Choi, J., Hong, J.B., Kim, H.: On the effectiveness of perturbations in generating evasive malware variants. IEEE Access 11 (2023). https://doi.org/10.1109/ACCESS.2023. 3262265 16. Zhu, S., Zhang, Z., Yang, L., Song, L., Wang, G.: Benchmarking label dynamics of VirusTotal engines. In: Proceedings of the ACM Conference on Computer and Communications Security (2020). https://doi.org/10.1145/3372297.3420013 17. Maulat Nasri, N.N.: Android malware detection system using machine learning. Int. J. Adv. Trends Comput. Sci. Eng. 9(1.5) (2020). https://doi.org/10.30534/ijatcse/2020/4691.52020 18. Akhtar, M.S., Feng, T.: Malware analysis and detection using machine learning algorithms. Symmetry (Basel) 14(11) (2022). https://doi.org/10.3390/sym14112304 19. Jacobs, I.S., Bean, C.P.: Fine particles, thin films and exchange anisotropy (effects of finite dimensions and interfaces on the basic properties of ferromagnets). In: Spin Arrangements and Crystal Structure, Domains, and Micromagnetics (1963). https://doi.org/10.1016/b978- 0-12-575303-6.50013-0 20. Carlin, D., O’Kane, P., Sezer, S.: A cost analysis of machine learning using dynamic runtime opcodes for malware detection. Comput. Secur. 85 (2019). https://doi.org/10.1016/j.cose. 2019.04.018
  • 14.
    796 W. ElMouhtadi et al. 21. On certain integrals of Lipschitz-Hankel type involving products of Bessel functions. Philos. Trans. Roy. Soc. Lond. Ser. A Math. Phys. Sci. 247(935) (1955). https://doi.org/10.1098/rsta. 1955.0005 22. Ahmadi, M., Ulyanov, D., Semenov, S., Trofimov, M., Giacinto, G.: Novel feature extraction, selection and fusion for effective malware family classification. In: CODASPY 2016 - Pro- ceedings of the 6th ACM Conference on Data and Application Security and Privacy (2016). https://doi.org/10.1145/2857705.2857713 23. Kramer, O.: Scikit-Learn. In: Studies in Big Data, vol. 20 (2016). https://doi.org/10.1007/ 978-3-319-33383-0_5 View publication stats