i
Philadelphia University
FACULTY OF IT
Project Title:
Improving Snort Techniques
Using Machine Learning
Supervisor:
Dr. Athari Alnatshah
Group Members:
Yazan Majdi Arman 202211016
Omar Saleem Al-Bishtawi 202210748
Mohammad Yousef Abu Jneineh 202110500
Abdulrahman Asaad Jalamneh 202210980
1st
semester
2025-2026
ii
Approval
We certify that we have read the project, and as a member of project evaluation committee we
had examined the students in the content of this document and knowledge related to it, and we
certify that it is adequate with standings as a project for partial fulfilment of the requirements
of Cyber-Security department.
iii
Certificate
It is certified that this project has been prepared and written under my direct supervision and
guidance. I also would like to certify that this document is approved for submission and
evaluation.
Supervisor:
Signature:
Date:
iv
Dedication
We dedicate this work to our parents and families. Thank you for being our constant source of
strength and for providing the encouragement needed to turn our aspirations into reality. This
achievement is as much yours as it is ours.
v
Acknowledgment
First and foremost, we would like to express our deepest gratitude to our supervisor, Dr.
Athari Alnatshah, for her invaluable guidance, patience, and expert advice throughout the
development of this project.
We extend our special thanks to our friends for their unwavering support and motivation.
Furthermore, your contributions were essential in shaping the functionality of this system.
Finally, we thank everyone who contributed, directly or indirectly, to the successful completion
of this work.
vi
Table of Contents
Chapter 1 - Introduction......................................................................................................... 9
1.1 Introduction:................................................................................................................. 9
1.2 Background: ................................................................................................................. 9
1.3 Problem Statement:...................................................................................................... 9
1.4 Limitations: ................................................................................................................ 10
1.5 Project Objectives:...................................................................................................... 10
1.6 Project Solution Overview: .......................................................................................... 10
Chapter 2 – Literature Review............................................................................................... 13
2.1 Introduction................................................................................................................ 13
2.2 Review of existing systems: ......................................................................................... 13
2.3 Comparison of existing solutions:................................................................................ 15
2.4 Evaluation of current states:........................................................................................ 15
Chapter 3 – Methodology & Plan ........................................................................................... 17
3.1 Introduction................................................................................................................ 17
3.3 Requirement Gathering Techniques (Waterfall)............................................................ 17
3.4 Project Plan (Gantt Chart) ........................................................................................... 18
3.5 Development Tools and Technologies.......................................................................... 18
Chapter 4 – System Specification ......................................................................................... 20
4.1 Introduction................................................................................................................ 20
4.2 Functional Requirements............................................................................................ 21
4.3 Logical system............................................................................................................ 23
Chapter 5 – Implementation & Testing .................................................................................. 26
5.1 Introduction................................................................................................................ 26
5.2 Testing phase ............................................................................................................. 30
vii
Chapter 6 – Future work & Conclution................................................................................... 32
6.1 Introduction................................................................................................................ 32
6.2 Future Work................................................................................................................ 32
6.3 Conclusion................................................................................................................. 32
References: - ....................................................................................................................... 33
Academic Journals & Papers............................................................................................. 33
Company Technology & Software...................................................................................... 34
Methodology .................................................................................................................... 34
List of Tables
Table 1: Tasks Chart ............................................................................................................. 12
Table 2 Examines the evolution of Network Intrusion Detection Systems................................ 13
Table 3 Comparative Analysis of Techniques ......................................................................... 14
Table 4: Comparison of existing solutions ............................................................................. 15
8
Abstract
Sophisticated network cyberattacks continue to pose a significant and
costly risk, highlighting the inherent limitations of traditional signature-based
security systems that are purely reactive and often result in high false positive
rates. This project addresses the critical need for an accessible and resource
efficient AI-enhanced network intrusion identification system capable of
effectively distinguishing abnormal network behaviour from benign traffic. The
methodology involved utilizing a robust ensemble machine learning approach,
concentrating on the analysis of behavioural network flow features to establish a
reliable detection capability. Crucially, the model was hardened using specialized
techniques to correct for the challenges posed by uneven data distributions,
ensuring effective identification across all threat categories. The resulting
lightweight solution successfully demonstrates high detection reliability while
significantly minimizing false positive alarms, offering a practical and
economically sound approach to proactive security enhancement for
organizations that lack high-end computational infrastructure.
9
Chapter 1 - Introduction
1.1 Introduction:
In these days connected digital landscape, cyberattacks pose a serious
threat to organizations, corporations, governments, and individuals alike.
Among the various forms of cyber threats, network-based attacks are some of
the most prevalent and dangerous, as they target the fundamental
infrastructure that enables communication and data exchange.
To address this growing threat, this project intends an AI-enhanced
network cyberattack system focused specifically on network-level intrusions.
Researchers and many security companies around the world like Fortinet are
trying to reach that for cyberattacks which could prevent it. Recent studies and
real-world applications combined between AI models and traditional methods
in both accuracy and speed when it comes to identifying network intrusions. As
cyber threats continue to evolve, the integration of intelligent systems into
network monitoring and defence strategies becomes not only beneficial but
necessary.
1.2 Background:
The motivation behind this project falls under the urgent need to
strengthen cyber security defences in an increasingly connected and
vulnerable digital environment. As organizations continue to expand their digital
operations, the attack surface of their networks grows correspondingly, making
them more susceptible to cyber threats, particularly those that originate or
propagate through network traffic.
This work concentrates on the integration of predictive powers into
cybersecurity systems to confront these challenges. By utilizing AI, ML and
Deep learning, the project intends to shift away from the passive, post-incident
response approach to an active form of threat intelligence anticipation. This is
to enable the early detection of potential attacks, enhanced situational
awareness, and overall improved security posture of digital infrastructures. In
essence, it is the wider goal of turning cybersecurity into a dynamic, predictive
discipline in tandem with the evolution of the threat landscape.
1.3 Problem Statement:
Several tools have attempted to shift from traditional detection methods
to proactive security using artificial intelligence. Some of them leverage self-
learning AI to anticipate abnormal network behaviours and flag potential
breaches before they occur, while other depends on the predictive defence
methodology, these systems often suffer from high false positive rates, where
normal activities are mistakenly flagged as threats. This not only wastes
resources but also reduces confidence in the system’s calculations.
Other solutions use machine learning models trained on historical data
to predict and block malware or network intrusions before execution. These
systems show promise, but they come at a significant cost both financially
10
and computationally. Their reliance on powerful infrastructure and premium
licensing models makes them inaccessible to many users and small
organizations that would benefit most from predictive security capabilities.
Main problems: -
▪ False positives alerts
▪ High maintenance cost
1.4 Limitations:
The scope and final outcomes of this project are subject to the following constraints:
• Data Dependency: The model's training is based on a well-known benchmark
dataset, meaning its performance may require further fine-tuning when applied
to a specific, unique live network environment.
• Tool Constraints: The project implementation relies on readily available
standard open-source programming and machine learning frameworks, which
limits the use of specialized or proprietary technologies.
• Deployment Scope: The project concludes with the creation of the final, ready-
to-use model file. Integration into a full, live network security infrastructure for
continuous operation is excluded from this phase due to time constraints.
1.5 Project Objectives:
The primary objectives this project intends to accomplish are:
To develop a machine learning model capable of identifying network-level
cyberattacks by analysing network flow data.
To significantly reduce false positives and ensure a high detection success rate
across different threat types by utilizing methods designed to handle uneven
data distribution.
To establish an effective threat identification capability by deploying a
lightweight and resource-efficient architecture.
To create a security solution that is accessible and can operate effectively
without necessitating the purchase of high-end computational infrastructure.
1.6 Project Solution Overview:
The proposed solution involves an AI-enhanced intrusion identification system
based on an effective ensemble learning method. This algorithm was selected due to
its demonstrated robustness when processing complex network traffic data. The
solution focuses on extracting and analysing behavioural network flow features such
as connection metrics and data transfer volumes to distinguish between benign and
malicious activity. The model is specially trained with methods to address common
challenges posed by unbalanced datasets, which is essential to ensure that less
frequent but critical attack types are reliably detected.
11
1.7 Project Scope:
Project Scope Covered:
Thorough feature preparation and processing of the network dataset to
maximize the model's performance.
Selection and rigorous optimization of an optimal classification algorithm.
Implementation of techniques to handle the challenges of skewed data
distribution across different threat categories.
Validation and comparative testing of the final model's performance using
standard, appropriate security metrics.
Preparation of the final, fully optimized model and data handling pipeline for
future deployment.
Project Scope Not Covered:
Creation of a comprehensive Graphical User Interface (GUI) for operational use
by an end-user.
Development of a complex live API for real-time packet capture and integration
with commercial network security tools.
Detailed exploration or comparison of complex deep learning architectures, which
contradict the project's goal of a lightweight design.
1.8 Project Feasibility:
The project has been assessed and confirmed to be practical and achievable across
all key areas:
Technical Feasibility: The core technology (Machine Learning, data
processing, and programming frameworks) is well-established and accessible,
and the initial algorithmic testing confirms the technical viability of the
approach.
Operational Feasibility: The goal is to produce a resource-light model that
analyses standard network flow data. This design makes the solution
compatible and practical for integration into existing network monitoring
processes.
Economic Feasibility: By relying exclusively on open-source software and
selecting a resource-efficient model architecture, the project minimizes
development and potential deployment costs, ensuring it remains an
economically sound alternative to expensive commercial systems.
12
Table 1: Tasks Chart
0 1 2 3 4 5 6 7
Yazan
Omar
Abdulrhman
Mohammad
TIME SCHEDULE
Dataset Tools AI-Algo ML
13
Chapter 2 – Literature Review
2.1 Introduction
In the context of this project, several key terms are defined to establish clarity.
Cyberattack detection refers to the use of data-driven methods, particularly artificial
intelligence, to anticipate and forecast potential malicious activities on a network
before they occur. Machine Learning (ML) is a branch of artificial intelligence that
enables systems to learn patterns from historical data and make decisions without
being explicitly programmed. A network flow represents a sequence of packets
sharing common properties such as source and destination IP addresses, ports,
and protocol types, which are crucial features used for model training. Feature
extraction involves selecting relevant attributes from raw network data that are
most useful for identifying potential threats. Classification algorithms are
supervised ML models such as Decision Trees, Random Forests, or Support
Vector Machines that categorize data into labels like “attack” or “normal.” The term
dataset refers to a structured collection of historical network records used to train
and evaluate the model. Real-time implies the system’s ability to analyse and
respond to incoming traffic almost instantly, enhancing proactive defence. Lastly,
false positives and false negatives represent incorrect alerts where the model
either wrongly flags normal activity as malicious or fails to detect an actual attack,
respectively. Understanding these terms is essential for interpreting the
methodology and results of this study.
2.2 Review of existing systems:
This literature review examines the evolution of Network Intrusion Detection Systems (NIDS)
from 2011 to 2019, highlighting a shift from foundational data preprocessing and taxonomic
strategies toward high-performance frameworks like Hadoop and Cloud-based environments.
The research emphasizes the integration of machine learning, specifically Random Forest
methods, and the critical role of data collection, evidenced by the categorization of 34 public
datasets to improve detection accuracy and scalability. As shown in Figure 2.1: -
Table 2 Examines the evolution of Network Intrusion Detection Systems
Author & Year Title Focus Topics
Shoham et al. (2023) Hybrid ddos detection Hybrid DDoS
Detection: A hybrid
machine learning
approach specifically
designed.
Detect and mitigate
Distributed Denial-of-
Service (DDoS) attacks.
Davis and Clark. 2011 Data preprocessing for
anomaly-based
network intrusion
detection: A review
Data preprocessing Relevant features
construction using
targeted content
parsing and deeper
network packet
inspection
Jeong et al. 2012 Anomaly tele-traffic
intrusion detection
systems on Hadoop-
based Platforms: A
Framework Hadoop and big data
platforms for speed,
storage volume, and
cost-efficiency
14
survey of problems and
solutions
Poston. 2012 A brief taxonomy of
intrusion detection
strategies
Strategies Taxonomy of traditional
network intrusion
detection
Modi et al. 2013 A survey of intrusion
detection techniques in
Cloud
Framework Incorporating IDS on
host system and virtual
machines
Keegan et al. 2016 A survey of cloud-
based network
intrusion detection
analysis
Framework Integrating machine
learning algorithms and
MapReduce to cloud
computing
environments
(Figure 2.1 review examines the evolution of Network Intrusion Detection Systems)
A comparative analysis of the various techniques adopts a multi-fold approach, where
the techniques are categorized based on their distinct characteristics, and then
compared to identify the advantages and disadvantages of each technique. The
categories chosen include (1) based on learning mechanism employed for
classification and detection, (2) based on features used for training and detection, (3)
AI techniques employed, and (4) based on the deployment. As shown in the Figure
2.2: -
Table 3 Comparative Analysis of Techniques
(Figure 2.2 COMPARATIVE ANALYSIS OF THE VARIOUS TECHNIQUES ADOPTS A MULTI-FOLD APPROACH)
15
2.3 Comparison of existing solutions:
Before presenting the comparison, it’s essential to note that the analysed tools
employ different approaches to cyberattack prediction. BforeAI PreCrime relies
on predictive modelling and domain behaviour scoring to forecast malicious
domains before they’re used. SafeBreach simulates attack scenarios using
breach and attack emulation techniques to assess defensive gaps. ThetaRay
applies deep learning and statistical anomaly detection over large-scale data
streams to uncover hidden threats. Darktrace uses self-learning AI and
clustering to identify abnormal behaviours in real time. Comparing these
technologies highlights how current solutions integrate AI in various forms
predictive analytics, simulation, deep anomaly detection, and self-learning to
achieve proactive cyber defence. Shown in Figure 2.3: -
Table 4: Comparison of existing solutions
(Figure 2.3 Shows a comparison of existing solutions)
2.3 Evaluation of current states:
The current state of network-based cyberattack systems has evolved
considerably with the integration of Artificial Intelligence (AI) and Machine
Learning (ML). Traditional rule-based Intrusion Detection Systems (IDS) are still
Tool Name Timeframe
Datasets
Used
Technique
Used
Accuracy Disadvantages
BforeAI
PreCrime
Up to 89
days
ahead
Domain
metadata,
behavioural
patterns
Predictive
modeming,
behavioural
scoring
50% for
domain-
level
prediction
Can’t detect
zero-day
attacks
SafeBreach
Hours to
days
ahead
Simulated
attack
methods
Attack
simulation,
scenario
modelling
Not
publicly
disclosed
Does not
predict
external
threats &
relies on
simulated
playbooks
ThetaRay
Real-time
(Hours
ahead)
Big data logs
Deep
anomaly
detection
(statistical
+ DL)
Not
publicly
disclosed
Not ideal for
IT-centric
threats;
specialized to
financial data
flows
Darktrace
Antigena
Minutes to
hours
ahead
Proprietary
traffic & self-
learned
organizational
behaviour
Self-
learning AI,
clustering,
anomaly
detection
Not
publicly
disclosed
Can generate
false
positives;
lacks
transparency
into detection
rationale
16
widely deployed; however, they struggle to predict sophisticated and previously
unknown attack patterns due to their reliance on static signatures and
predefined rules. This limitation has catalysed a shift toward more adaptive and
intelligent systems.
A significant issue in the current state of research is the over-reliance on
synthetic or outdated datasets, which may not reflect real world traffic. This can
result in high accuracy during testing but poor generalization to live networks.
Additionally, class imbalance (more benign traffic than attacks) benign is the
false positives or non-malicious data; and lack of diversity in attack types can
bias the models. Plus, many modern AI models, especially deep learning-based
ones, act as "black boxes." While their predictions may be accurate,
understanding why a certain traffic flow was classified as malicious is often
unclear. This poses a problem in critical environments where interpretability and
trust in automated decisions are necessary, which in result the use of the AI
explanation libraries could play a big role in the consequences of the project.
Another current state which drew our attention is that in AI based
cybersecurity systems is their vulnerability to adversarial machine learning.
Attackers can exploit weaknesses in the model by crafting malicious traffic
designed to evade detection, known as evasion attacks, like the recent fake
antiviruses that got trended a while ago. These may involve subtle
manipulations in network flow characteristics such as altering packet sizes,
interarrival times, or flow counts to bypass the model without triggering alerts.
Such attacks are particularly dangerous because they often go unnoticed while
still accomplishing malicious goal
Additional major threat comes in the form of poisoning attacks, where
attackers inject misleading or crafted data into the training pipeline. This
compromises the model’s learning process, causing it to make incorrect
forecast during deployment. In high-stakes environments like banking or critical
infrastructure, even a small decrease in model performance due to data
poisoning can lead to significant security breaches. Poisoned models may even
"learn" to ignore specific attack patterns entirely, leaving systems defenceless
against known threats.
Moreover, model inversion attacks present a privacy risk by allowing
adversaries to infer sensitive training data or internal logic of the model based
on its outputs. In network contexts, this could reveal user behaviour, device
types, or even traffic patterns related to secure internal services. These attacks
highlight that AI not only protects but also exposes new attack surfaces that
must be secured.
To defend against these vulnerabilities, AI-based systems must
incorporate robust training methods, such as adversarial training, encryption,
anomaly injection resistance, and continuous validation using fresh traffic.
Additionally, deploying these models in secure environments with strong access
controls and logging can reduce the risk of exploitation.
17
Chapter 3 – Methodology & Plan
3.1 Introduction
This chapter outlines the systematic approach used to design and implement
the Intelligent Model for Predicting Network Cyber-Attacks. The methodology serves
as the backbone of the project, ensuring that data is handled correctly, models are
trained rigorously, and the final system that provides reliable, real-time security
insights. By combining data science principles with network security protocols, this
methodology aims to achieve high detection accuracy with minimal false alarms.
3.2 System Development Methodology
The project follows the Waterfall Methodology, a linear and sequential software
development life cycle. Given the critical security requirements of an Intrusion
Detection System (IDS), this model was chosen to ensure that each stage from data
acquisition to system deployment reaches a state of total completion and validation
before the next phase commences.
The Waterfall approach provides a structured environment where requirements are
frozen after the initial phase. This stability was essential for the complex integration of
network monitoring tools (Snort/Zeek) with the Python-based AI backend, ensuring
that the data flow remained consistent and reliable. Security systems require clear
documentation and stable requirements.
Though by using Waterfall, we ensured that the selection of the dataset was perfectly
calibrated before we began training the Random Forest model. This minimized errors
in the integration phase.
3.3 Requirement Gathering Techniques (Waterfall)
To establish a robust system specification, multiple information-gathering
strategies were utilized:
Document Analysis: An exhaustive review of the UNSW-NB15 dataset technical
whitepapers was conducted to understand feature engineering requirements.
Additionally, the official documentation for Snort and Zeek was analysed to map
log formats to AI input variables.
Brainstorming & Expert Consultation: Technical sessions were held to define the
"Intelligence Suite" logic, specifically focusing on how to bridge the gap between
signature-based alerts and probabilistic AI behaviour.
Prototyping: Small-scale tests were performed to determine the hardware and
software limitations of running a Flask web server alongside real-time network
sniffers.
18
(Figure 3.1 Waterfall methodology)
3.4 Project Plan (Gantt Chart)
The project followed a strict timeline to ensure all components from the data
pre-processing all the way to the web interface in figure 3.2: -
(Figure 3.2 Gantt Chart tasks timeline)
3.5 Development Tools and Technologies
The project utilizes a modern Python-based stack designed for high-performance
machine learning and data processing.
19
Data Pre-processing: -
Programming Language (Python): Chosen for its extensive ecosystem of security
and AI libraries.
Scikit-Learn (Library): Used to build the "Fail-Safe" Pipeline, including the Column
Transformer and Voting Classifier.
Joblib (Persistence): Used to produce the .pkl file (pickle), allowing the model to be
saved and loaded into a real-time environment without retraining.
Pandas & NumPy: Essential for handling the large-scale data structures of network
traffic logs.
Random Forest & PCA: Random Forest serves as the core intelligence layer,
utilizing an ensemble of decision trees to classify network traffic through a majority
voting mechanism and the PCA was implemented to reduce the high
dimensionality of the UNSW-NB15 dataset by projecting it into a lower-dimensional
space while retaining 95% of the variance.
Matplotlib & Seaborn: Used to generate the "Intelligence Performance Summary"
and "Model Percentages" charts.
Google Colab: The primary development environments for writing and testing the
dataset pre-processing.
Implementation: -
Operating System (Linux Ubuntu): The main environment used for integrating
the model, providing a stable platform for running concurrent security services.
Network Analyzer (Zeek): Used for understanding traffic behaviour and generating
high-level metadata logs. Zeek transforms raw packets into structured data that the
AI can interpret.
Intrusion Detection System (Snort): Implemented for signature-based detection.
Snort applies predefined rules to the traffic stream to catch known malicious
patterns instantly.
Integrated Development Environment (PyCharm): The central IDE used to write
the integration code, managing the complex logic required to merge Zeek/Snort
outputs into a single Python application.
Web Framework (Flask): Used to develop the final dashboard. Flask acts as the
"Presentation Layer," pulling data from the AI model to show real-time "Normal vs.
Attack" percentages to the user SIEM dashboard.
20
Chapter 4 – System Specification
4.1 Introduction
This chapter describes the UNSW-NB15 network intrusion dataset used in the
project. The following sections explain the methodology used to extract features from
the dataset and the step-by-step workflow for data preparation. It details the process
of training the system using Random Forest and Support Vector Machine (SVM)
algorithms based on behavioural analysis. After the dataset is processed and verified,
the findings from the AI's feature importance analysis are used to generate manual
Snort rules. This hybrid approach allows the system to identify both known signatures
and abnormal behaviour. As figure 4.1 illustrates the general stages of the
methodology, from data pre-processing to final performance evaluation.
(Figure 4.1 System chart)
21
4.2 Functional Requirements
Dataset Selection: The dataset used in this project is the UNSW-NB15, a
modern network intrusion dataset. Unlike older datasets (like KDD99), UNSW-NB15
contains a comprehensive variety of contemporary synthesized attack activities and
normal traffic. The dataset consists of over 250,000 samples categorized into nine
attack types (Fuzzers, Exploits, DoS, and Generic) and Normal traffic. It was obtained
in CSV format, containing 45 features including duration, protocol type, and packet
counts.
Data Preprocessing: In this stage, the raw network logs are cleaned and transformed.
Redundant data is removed, and categorical strings (like "TCP" or "UDP") are
converted into numerical values that the AI can understand.
I. Cleaning: To ensure the model does not crash, an imputation strategy was
used. Any missing values (NaNs) or undefined records were handled using
SimpleImputer. Numerical gaps were filled with the median value, and
categorical gaps were filled with an "unknown" placeholder.
II. Feature Encoding & Normalization: Categorical features were processed using
One-Hot Encoding, and numerical features were scaled using StandardScaler
to ensure that high-value features (like sbytes) did not overshadow smaller
values (like dur).
III. Balancing: The dataset was stratified during the split to ensure that both
"Normal" and "Attack" classes were proportionally represented in both training
and testing phases.
Feature Selection (RF + PCA): To achieve maximum efficiency in a live SIEM (Security
Information and Event Management) environment, a two-step reduction process was
used:
1. Random Forest (RF): First, RF was used to determine the "Feature
Importance." This identifies which network attributes (like sttl or sbytes) have
the most significant impact on detecting an attack.
2. Principal Component Analysis (PCA): Following RF, PCA was applied to reduce
dimensionality. We retained 95% of the cumulative variance, which compressed
the wide dataset into a core set of 22 Principal Components. This reduces
processing time without sacrificing the accuracy of the detection engine.
22
In this (4.2) figure, we demonstrate the most important indicators for detecting the
attacks: -
(Figure 4.2: Top 10 Network features attack indicators)
The Random Forest model works by asking a series of "Yes/No" questions (Decision
Trees).
• A feature is "Important" if it is the best at splitting the data into two clean
groups (Normal vs. Attack).
• If a feature like sttl (Source TTL) can instantly separate 80% of attacks from
normal traffic, the AI gives it a high "Importance Score."
Mathematically, this is often measured by the Gini Impurity index:
23
4.3 Logical system
Random Forest (RF): Random Forest is an ensemble learning method that
builds multiple decision trees and merges them to get a more accurate and stable. In
our model, RF acts as the primary "Heavy Hitter" classifier. It is highly resistant to
overfitting and excels at identifying complex patterns in high-dimensional network
data.
Support Vector Machine (SVM): SVM is a supervised learning model that finds the
optimal hyperplane to separate classes in a multi-dimensional space. It is particularly
effective for binary classification of network traffic where the boundary between
"Normal" and "Attack" is narrow.
Figure (4.3) presents the Receiver Operating Characteristic (ROC) curve of the
proposed AI-based intrusion detection system. The ROC curve illustrates the
relationship between the True Positive Rate (TPR) and the False Positive Rate
(FPR) at various classification thresholds.
The solid curve represents the performance of the Random Forest classifier, while
the diagonal dashed line indicates the performance of a random classifier. The
model achieves an Area Under the Curve (AUC) of 0.990, which demonstrates
excellent discriminative capability between normal and attack traffic.
(Figure 4.3 ROC curve)
24
Random Forest Results: After applying PCA, the Random Forest model
demonstrated outstanding performance. It achieved an overall Accuracy of 94.82%.
The model showed high precision, meaning it generated very few false alarms,
making it suitable for a real-time SOC environment.
SVM Results: The SVM model achieved an Accuracy of 93.32%. While slightly lower
than RF, the SVM provided a robust secondary validation, particularly in identifying
"Stealth" attacks that involve low packet counts.
As shown in figure (4.3) : -
(Figure 4.3 Percentages)
The figure 4.4 diagram illustrates the integrated workflow of the hybrid
detection system, where live network traffic is captured and simultaneously
processed through two primary channels. On the left, Snort applies rules to identify
known malicious patterns, while on the right, the Machine Learning pipeline
(Random Forest/SVM) performs behavioural analysis on the UNSW-NB15 dataset
features. The central integration module consolidates these findings to provide a
comprehensive security verdict, which is then visualized in real-time via the
administrative dashboard.
25
(Figure 4.4 Workflow)
26
Chapter 5 – Implementation & Testing
5.1 Introduction
This chapter details the transition of the project from its theoretical and architectural
design into a functional, multi-layered security prototype for Improving Snort
Techniques Using Machine Learning. The implementation phase focuses on the
integration of established network monitoring tools, specifically Snort and Zeek, with
a custom Python-based AI backend. By utilizing the UNSW-NB15 dataset and the
machine learning models, specifically Random Forest and KNN, specified in the
previous chapter, this phase demonstrates how raw network traffic is transformed
into actionable security intelligence.
Figure 5.1 demonstrates the libraries used for starting the pre-processing phase: -
(Figure 5.1 Data Pre-processing)
27
Figure 5.2 illustrates the result of the pre-processing percentages in which contains
(Accuracy, Recall, Precision, F1-score)
(Figure 5.2 Results)
Figure 5.3: The start of the integration between the AI model (intrution_model.pkl),
Snort alerts (snort/alert) and the Zeek network monitoring file (conn.log)
( Figure 5.3 Model integration )
28
Figure 5.4: By using the Flask framework, we made an app.py file for creating the
events dashboard on a specific port
( Figure 5.4 App Flask )
Figure 5.5: Displaying the Snort rules we have installed based on the attacks best
captured by our model
(Figure 5.5 Snort Rules)
29
(Figure 5.6 Sign-in User interface)
Figure 5.7: Presenting the final GUI events dashboard in which also shows some
events testing
(Figure 5.7 Home Dashboard)
30
Figure 5.8: After an incident, the in-charged person can download a Report.pdf in
which will present the final 10 attacks with its meta-data
(Figure 5.8 Report file last events)
5.2 Testing phase
Figure 5.2.1 shows the launch of a PS script using the secure shell port for testing
the AI model for detecting a high volume of connections as a potential DoS/Probing
attack
(Figure 5.2.1)
31
Figure 5.2.2 Displays post-incident, how will the event log be presented for the user
(Figure 5.2.2)
32
Chapter 6 – Future work & Conclusion
6.1 Introduction
This chapter demonstrates a comprehensive analysis of the results obtained
from the development of this project. The chapter aims to discuss the data collected
during the implementation phase, evaluate the effectiveness of the system through
testing, and analyse testing to provide context for the findings. The overall goal is to
gain insights into the strengths and weaknesses of the developed system and to
understand the extent to which it meets its intended objectives.
6.2 Future Work
While this study successfully demonstrated the integration of the Ai models with
snort & zeek for threat detection, several avenues for future research remain:
• Real-time Deployment and Scalability: Future research could focus on
transitioning the current model from a controlled environment (Google Colab)
to a live, high-speed network environment. Investigating the computational
overhead of deep learning models in real-time "inline" deployments would be
critical for industry adoption.
• Expansion of Datasets: While the UNSW-NB15 dataset provides a robust
foundation, future iterations of this work should incorporate more recent
datasets (such as CSE-CIC-IDS2020) to account for evolving exploits and
modern encrypted traffic patterns.
• Adversarial Machine Learning: A significant area for further study is the
resilience of AI models against adversarial attacks, where attackers
specifically craft traffic to bypass neural network detection.
6.3 Conclusion
This project successfully enhanced traditional intrusion detection by
integrating Machine Learning with Snort to address the limitations of signature-based
systems. Through the application of Random Forest and SVM models on the UNSW-
NB15 dataset, the system demonstrated a superior ability to identify the specified
threats. The implementation proved that behavioural analysis significantly reduces
false positives while maintaining high detection accuracy across diverse network
traffic. Findings confirm that this hybrid approach provides a more robust and
scalable defence mechanism than conventional methods alone. While the current
prototype shows high efficiency, future work could explore deep learning
architectures to further automate threat mitigation. Ultimately, this research bridges
the gap between static rule sets and dynamic AI-driven security, offering a more
resilient framework for modern cybersecurity environments.
33
References: -
Academic Journals & Papers
• Moamin, S. A., Abdulhameed, M. K., Al-Amri, R. M., Radhi, A. D.,
Naser, R. K., & Pheng, L. G. (2025). Artificial Intelligence in Malware
and Network Intrusion Detection: A Comprehensive Survey of
Techniques, Datasets, Challenges, and Future Directions. Babylonian
Journal of Artificial Intelligence, 2025, 77-98.
• Jonathan J. Davis and Andrew J. Clark. 2011. Data preprocessing for
anomaly based network intrusion detection: Areview. Comput.
Secur. 30, 6 (2011), 353–375.
• H. J. Jeong, W. Hyun, J. Lim, and I. You. 2012. Anomaly teletraffic
intrusion detection systems on Hadoop-basedplatforms: A survey of
some problems and solutions. In 15th International Conference on
Network-based InformationSystems. 766–770
• H. E. Poston. 2012. A brief taxonomy of intrusion detection strategies.
In IEEE National Aerospace and ElectronicsConference. 255–263.
• Chou, D., & Jiang, M. (2021). A survey on data-driven network
intrusion detection. ACM Computing Surveys (CSUR), 54(9), 1-36.
• Moustafa, N., & Slay, J. (2015, November). UNSW-NB15: a
comprehensive data set for network intrusion detection systems
(UNSW-NB15 network data set). In 2015 military communications
and information systems conference (MilCIS) (pp. 1-6). IEEE.
• Shohan, N. J., Tanbhir, G., Elahi, F., Ullah, A., & Sakib, M. N. (2023,
December). Enhancing network security: A hybrid approach for
detection and mitigation of distributed denial-of-service attacks
using machine learning. In International Conference on Advanced
Network Technologies and Intelligent Computing (pp. 81-95). Cham:
Springer Nature Switzerland.
34
Company Technology & Software
• Bfore.Ai. (n.d.). Predictive domain scoring and pre-emptive threat
intelligence: Technology overview. https://bfore.ai/
• Darktrace. (n.d.). Autonomous response and real-time behavioural
clustering: Cyber AI research. https://www.darktrace.com/
• Google. (n.d.). Google Colab. https://colab.google/
• PythonAnywhere. (n.d.). PythonAnywhere host services.
https://www.pythonanywhere.com/
• SafeBreach. (n.d.). Attack emulation and defensive gap assessment:
Platform features. https://www.safebreach.com/
• ThetaRay. (n.d.). Deep learning and statistical analysis for large-scale
data. https://thetaray.com/
Methodology
• Lucid Software. (n.d.). Waterfall methodology overview.
https://www.lucidchart.com/

final.pdf

  • 1.
    i Philadelphia University FACULTY OFIT Project Title: Improving Snort Techniques Using Machine Learning Supervisor: Dr. Athari Alnatshah Group Members: Yazan Majdi Arman 202211016 Omar Saleem Al-Bishtawi 202210748 Mohammad Yousef Abu Jneineh 202110500 Abdulrahman Asaad Jalamneh 202210980 1st semester 2025-2026
  • 2.
    ii Approval We certify thatwe have read the project, and as a member of project evaluation committee we had examined the students in the content of this document and knowledge related to it, and we certify that it is adequate with standings as a project for partial fulfilment of the requirements of Cyber-Security department.
  • 3.
    iii Certificate It is certifiedthat this project has been prepared and written under my direct supervision and guidance. I also would like to certify that this document is approved for submission and evaluation. Supervisor: Signature: Date:
  • 4.
    iv Dedication We dedicate thiswork to our parents and families. Thank you for being our constant source of strength and for providing the encouragement needed to turn our aspirations into reality. This achievement is as much yours as it is ours.
  • 5.
    v Acknowledgment First and foremost,we would like to express our deepest gratitude to our supervisor, Dr. Athari Alnatshah, for her invaluable guidance, patience, and expert advice throughout the development of this project. We extend our special thanks to our friends for their unwavering support and motivation. Furthermore, your contributions were essential in shaping the functionality of this system. Finally, we thank everyone who contributed, directly or indirectly, to the successful completion of this work.
  • 6.
    vi Table of Contents Chapter1 - Introduction......................................................................................................... 9 1.1 Introduction:................................................................................................................. 9 1.2 Background: ................................................................................................................. 9 1.3 Problem Statement:...................................................................................................... 9 1.4 Limitations: ................................................................................................................ 10 1.5 Project Objectives:...................................................................................................... 10 1.6 Project Solution Overview: .......................................................................................... 10 Chapter 2 – Literature Review............................................................................................... 13 2.1 Introduction................................................................................................................ 13 2.2 Review of existing systems: ......................................................................................... 13 2.3 Comparison of existing solutions:................................................................................ 15 2.4 Evaluation of current states:........................................................................................ 15 Chapter 3 – Methodology & Plan ........................................................................................... 17 3.1 Introduction................................................................................................................ 17 3.3 Requirement Gathering Techniques (Waterfall)............................................................ 17 3.4 Project Plan (Gantt Chart) ........................................................................................... 18 3.5 Development Tools and Technologies.......................................................................... 18 Chapter 4 – System Specification ......................................................................................... 20 4.1 Introduction................................................................................................................ 20 4.2 Functional Requirements............................................................................................ 21 4.3 Logical system............................................................................................................ 23 Chapter 5 – Implementation & Testing .................................................................................. 26 5.1 Introduction................................................................................................................ 26 5.2 Testing phase ............................................................................................................. 30
  • 7.
    vii Chapter 6 –Future work & Conclution................................................................................... 32 6.1 Introduction................................................................................................................ 32 6.2 Future Work................................................................................................................ 32 6.3 Conclusion................................................................................................................. 32 References: - ....................................................................................................................... 33 Academic Journals & Papers............................................................................................. 33 Company Technology & Software...................................................................................... 34 Methodology .................................................................................................................... 34 List of Tables Table 1: Tasks Chart ............................................................................................................. 12 Table 2 Examines the evolution of Network Intrusion Detection Systems................................ 13 Table 3 Comparative Analysis of Techniques ......................................................................... 14 Table 4: Comparison of existing solutions ............................................................................. 15
  • 8.
    8 Abstract Sophisticated network cyberattackscontinue to pose a significant and costly risk, highlighting the inherent limitations of traditional signature-based security systems that are purely reactive and often result in high false positive rates. This project addresses the critical need for an accessible and resource efficient AI-enhanced network intrusion identification system capable of effectively distinguishing abnormal network behaviour from benign traffic. The methodology involved utilizing a robust ensemble machine learning approach, concentrating on the analysis of behavioural network flow features to establish a reliable detection capability. Crucially, the model was hardened using specialized techniques to correct for the challenges posed by uneven data distributions, ensuring effective identification across all threat categories. The resulting lightweight solution successfully demonstrates high detection reliability while significantly minimizing false positive alarms, offering a practical and economically sound approach to proactive security enhancement for organizations that lack high-end computational infrastructure.
  • 9.
    9 Chapter 1 -Introduction 1.1 Introduction: In these days connected digital landscape, cyberattacks pose a serious threat to organizations, corporations, governments, and individuals alike. Among the various forms of cyber threats, network-based attacks are some of the most prevalent and dangerous, as they target the fundamental infrastructure that enables communication and data exchange. To address this growing threat, this project intends an AI-enhanced network cyberattack system focused specifically on network-level intrusions. Researchers and many security companies around the world like Fortinet are trying to reach that for cyberattacks which could prevent it. Recent studies and real-world applications combined between AI models and traditional methods in both accuracy and speed when it comes to identifying network intrusions. As cyber threats continue to evolve, the integration of intelligent systems into network monitoring and defence strategies becomes not only beneficial but necessary. 1.2 Background: The motivation behind this project falls under the urgent need to strengthen cyber security defences in an increasingly connected and vulnerable digital environment. As organizations continue to expand their digital operations, the attack surface of their networks grows correspondingly, making them more susceptible to cyber threats, particularly those that originate or propagate through network traffic. This work concentrates on the integration of predictive powers into cybersecurity systems to confront these challenges. By utilizing AI, ML and Deep learning, the project intends to shift away from the passive, post-incident response approach to an active form of threat intelligence anticipation. This is to enable the early detection of potential attacks, enhanced situational awareness, and overall improved security posture of digital infrastructures. In essence, it is the wider goal of turning cybersecurity into a dynamic, predictive discipline in tandem with the evolution of the threat landscape. 1.3 Problem Statement: Several tools have attempted to shift from traditional detection methods to proactive security using artificial intelligence. Some of them leverage self- learning AI to anticipate abnormal network behaviours and flag potential breaches before they occur, while other depends on the predictive defence methodology, these systems often suffer from high false positive rates, where normal activities are mistakenly flagged as threats. This not only wastes resources but also reduces confidence in the system’s calculations. Other solutions use machine learning models trained on historical data to predict and block malware or network intrusions before execution. These systems show promise, but they come at a significant cost both financially
  • 10.
    10 and computationally. Theirreliance on powerful infrastructure and premium licensing models makes them inaccessible to many users and small organizations that would benefit most from predictive security capabilities. Main problems: - ▪ False positives alerts ▪ High maintenance cost 1.4 Limitations: The scope and final outcomes of this project are subject to the following constraints: • Data Dependency: The model's training is based on a well-known benchmark dataset, meaning its performance may require further fine-tuning when applied to a specific, unique live network environment. • Tool Constraints: The project implementation relies on readily available standard open-source programming and machine learning frameworks, which limits the use of specialized or proprietary technologies. • Deployment Scope: The project concludes with the creation of the final, ready- to-use model file. Integration into a full, live network security infrastructure for continuous operation is excluded from this phase due to time constraints. 1.5 Project Objectives: The primary objectives this project intends to accomplish are: To develop a machine learning model capable of identifying network-level cyberattacks by analysing network flow data. To significantly reduce false positives and ensure a high detection success rate across different threat types by utilizing methods designed to handle uneven data distribution. To establish an effective threat identification capability by deploying a lightweight and resource-efficient architecture. To create a security solution that is accessible and can operate effectively without necessitating the purchase of high-end computational infrastructure. 1.6 Project Solution Overview: The proposed solution involves an AI-enhanced intrusion identification system based on an effective ensemble learning method. This algorithm was selected due to its demonstrated robustness when processing complex network traffic data. The solution focuses on extracting and analysing behavioural network flow features such as connection metrics and data transfer volumes to distinguish between benign and malicious activity. The model is specially trained with methods to address common challenges posed by unbalanced datasets, which is essential to ensure that less frequent but critical attack types are reliably detected.
  • 11.
    11 1.7 Project Scope: ProjectScope Covered: Thorough feature preparation and processing of the network dataset to maximize the model's performance. Selection and rigorous optimization of an optimal classification algorithm. Implementation of techniques to handle the challenges of skewed data distribution across different threat categories. Validation and comparative testing of the final model's performance using standard, appropriate security metrics. Preparation of the final, fully optimized model and data handling pipeline for future deployment. Project Scope Not Covered: Creation of a comprehensive Graphical User Interface (GUI) for operational use by an end-user. Development of a complex live API for real-time packet capture and integration with commercial network security tools. Detailed exploration or comparison of complex deep learning architectures, which contradict the project's goal of a lightweight design. 1.8 Project Feasibility: The project has been assessed and confirmed to be practical and achievable across all key areas: Technical Feasibility: The core technology (Machine Learning, data processing, and programming frameworks) is well-established and accessible, and the initial algorithmic testing confirms the technical viability of the approach. Operational Feasibility: The goal is to produce a resource-light model that analyses standard network flow data. This design makes the solution compatible and practical for integration into existing network monitoring processes. Economic Feasibility: By relying exclusively on open-source software and selecting a resource-efficient model architecture, the project minimizes development and potential deployment costs, ensuring it remains an economically sound alternative to expensive commercial systems.
  • 12.
    12 Table 1: TasksChart 0 1 2 3 4 5 6 7 Yazan Omar Abdulrhman Mohammad TIME SCHEDULE Dataset Tools AI-Algo ML
  • 13.
    13 Chapter 2 –Literature Review 2.1 Introduction In the context of this project, several key terms are defined to establish clarity. Cyberattack detection refers to the use of data-driven methods, particularly artificial intelligence, to anticipate and forecast potential malicious activities on a network before they occur. Machine Learning (ML) is a branch of artificial intelligence that enables systems to learn patterns from historical data and make decisions without being explicitly programmed. A network flow represents a sequence of packets sharing common properties such as source and destination IP addresses, ports, and protocol types, which are crucial features used for model training. Feature extraction involves selecting relevant attributes from raw network data that are most useful for identifying potential threats. Classification algorithms are supervised ML models such as Decision Trees, Random Forests, or Support Vector Machines that categorize data into labels like “attack” or “normal.” The term dataset refers to a structured collection of historical network records used to train and evaluate the model. Real-time implies the system’s ability to analyse and respond to incoming traffic almost instantly, enhancing proactive defence. Lastly, false positives and false negatives represent incorrect alerts where the model either wrongly flags normal activity as malicious or fails to detect an actual attack, respectively. Understanding these terms is essential for interpreting the methodology and results of this study. 2.2 Review of existing systems: This literature review examines the evolution of Network Intrusion Detection Systems (NIDS) from 2011 to 2019, highlighting a shift from foundational data preprocessing and taxonomic strategies toward high-performance frameworks like Hadoop and Cloud-based environments. The research emphasizes the integration of machine learning, specifically Random Forest methods, and the critical role of data collection, evidenced by the categorization of 34 public datasets to improve detection accuracy and scalability. As shown in Figure 2.1: - Table 2 Examines the evolution of Network Intrusion Detection Systems Author & Year Title Focus Topics Shoham et al. (2023) Hybrid ddos detection Hybrid DDoS Detection: A hybrid machine learning approach specifically designed. Detect and mitigate Distributed Denial-of- Service (DDoS) attacks. Davis and Clark. 2011 Data preprocessing for anomaly-based network intrusion detection: A review Data preprocessing Relevant features construction using targeted content parsing and deeper network packet inspection Jeong et al. 2012 Anomaly tele-traffic intrusion detection systems on Hadoop- based Platforms: A Framework Hadoop and big data platforms for speed, storage volume, and cost-efficiency
  • 14.
    14 survey of problemsand solutions Poston. 2012 A brief taxonomy of intrusion detection strategies Strategies Taxonomy of traditional network intrusion detection Modi et al. 2013 A survey of intrusion detection techniques in Cloud Framework Incorporating IDS on host system and virtual machines Keegan et al. 2016 A survey of cloud- based network intrusion detection analysis Framework Integrating machine learning algorithms and MapReduce to cloud computing environments (Figure 2.1 review examines the evolution of Network Intrusion Detection Systems) A comparative analysis of the various techniques adopts a multi-fold approach, where the techniques are categorized based on their distinct characteristics, and then compared to identify the advantages and disadvantages of each technique. The categories chosen include (1) based on learning mechanism employed for classification and detection, (2) based on features used for training and detection, (3) AI techniques employed, and (4) based on the deployment. As shown in the Figure 2.2: - Table 3 Comparative Analysis of Techniques (Figure 2.2 COMPARATIVE ANALYSIS OF THE VARIOUS TECHNIQUES ADOPTS A MULTI-FOLD APPROACH)
  • 15.
    15 2.3 Comparison ofexisting solutions: Before presenting the comparison, it’s essential to note that the analysed tools employ different approaches to cyberattack prediction. BforeAI PreCrime relies on predictive modelling and domain behaviour scoring to forecast malicious domains before they’re used. SafeBreach simulates attack scenarios using breach and attack emulation techniques to assess defensive gaps. ThetaRay applies deep learning and statistical anomaly detection over large-scale data streams to uncover hidden threats. Darktrace uses self-learning AI and clustering to identify abnormal behaviours in real time. Comparing these technologies highlights how current solutions integrate AI in various forms predictive analytics, simulation, deep anomaly detection, and self-learning to achieve proactive cyber defence. Shown in Figure 2.3: - Table 4: Comparison of existing solutions (Figure 2.3 Shows a comparison of existing solutions) 2.3 Evaluation of current states: The current state of network-based cyberattack systems has evolved considerably with the integration of Artificial Intelligence (AI) and Machine Learning (ML). Traditional rule-based Intrusion Detection Systems (IDS) are still Tool Name Timeframe Datasets Used Technique Used Accuracy Disadvantages BforeAI PreCrime Up to 89 days ahead Domain metadata, behavioural patterns Predictive modeming, behavioural scoring 50% for domain- level prediction Can’t detect zero-day attacks SafeBreach Hours to days ahead Simulated attack methods Attack simulation, scenario modelling Not publicly disclosed Does not predict external threats & relies on simulated playbooks ThetaRay Real-time (Hours ahead) Big data logs Deep anomaly detection (statistical + DL) Not publicly disclosed Not ideal for IT-centric threats; specialized to financial data flows Darktrace Antigena Minutes to hours ahead Proprietary traffic & self- learned organizational behaviour Self- learning AI, clustering, anomaly detection Not publicly disclosed Can generate false positives; lacks transparency into detection rationale
  • 16.
    16 widely deployed; however,they struggle to predict sophisticated and previously unknown attack patterns due to their reliance on static signatures and predefined rules. This limitation has catalysed a shift toward more adaptive and intelligent systems. A significant issue in the current state of research is the over-reliance on synthetic or outdated datasets, which may not reflect real world traffic. This can result in high accuracy during testing but poor generalization to live networks. Additionally, class imbalance (more benign traffic than attacks) benign is the false positives or non-malicious data; and lack of diversity in attack types can bias the models. Plus, many modern AI models, especially deep learning-based ones, act as "black boxes." While their predictions may be accurate, understanding why a certain traffic flow was classified as malicious is often unclear. This poses a problem in critical environments where interpretability and trust in automated decisions are necessary, which in result the use of the AI explanation libraries could play a big role in the consequences of the project. Another current state which drew our attention is that in AI based cybersecurity systems is their vulnerability to adversarial machine learning. Attackers can exploit weaknesses in the model by crafting malicious traffic designed to evade detection, known as evasion attacks, like the recent fake antiviruses that got trended a while ago. These may involve subtle manipulations in network flow characteristics such as altering packet sizes, interarrival times, or flow counts to bypass the model without triggering alerts. Such attacks are particularly dangerous because they often go unnoticed while still accomplishing malicious goal Additional major threat comes in the form of poisoning attacks, where attackers inject misleading or crafted data into the training pipeline. This compromises the model’s learning process, causing it to make incorrect forecast during deployment. In high-stakes environments like banking or critical infrastructure, even a small decrease in model performance due to data poisoning can lead to significant security breaches. Poisoned models may even "learn" to ignore specific attack patterns entirely, leaving systems defenceless against known threats. Moreover, model inversion attacks present a privacy risk by allowing adversaries to infer sensitive training data or internal logic of the model based on its outputs. In network contexts, this could reveal user behaviour, device types, or even traffic patterns related to secure internal services. These attacks highlight that AI not only protects but also exposes new attack surfaces that must be secured. To defend against these vulnerabilities, AI-based systems must incorporate robust training methods, such as adversarial training, encryption, anomaly injection resistance, and continuous validation using fresh traffic. Additionally, deploying these models in secure environments with strong access controls and logging can reduce the risk of exploitation.
  • 17.
    17 Chapter 3 –Methodology & Plan 3.1 Introduction This chapter outlines the systematic approach used to design and implement the Intelligent Model for Predicting Network Cyber-Attacks. The methodology serves as the backbone of the project, ensuring that data is handled correctly, models are trained rigorously, and the final system that provides reliable, real-time security insights. By combining data science principles with network security protocols, this methodology aims to achieve high detection accuracy with minimal false alarms. 3.2 System Development Methodology The project follows the Waterfall Methodology, a linear and sequential software development life cycle. Given the critical security requirements of an Intrusion Detection System (IDS), this model was chosen to ensure that each stage from data acquisition to system deployment reaches a state of total completion and validation before the next phase commences. The Waterfall approach provides a structured environment where requirements are frozen after the initial phase. This stability was essential for the complex integration of network monitoring tools (Snort/Zeek) with the Python-based AI backend, ensuring that the data flow remained consistent and reliable. Security systems require clear documentation and stable requirements. Though by using Waterfall, we ensured that the selection of the dataset was perfectly calibrated before we began training the Random Forest model. This minimized errors in the integration phase. 3.3 Requirement Gathering Techniques (Waterfall) To establish a robust system specification, multiple information-gathering strategies were utilized: Document Analysis: An exhaustive review of the UNSW-NB15 dataset technical whitepapers was conducted to understand feature engineering requirements. Additionally, the official documentation for Snort and Zeek was analysed to map log formats to AI input variables. Brainstorming & Expert Consultation: Technical sessions were held to define the "Intelligence Suite" logic, specifically focusing on how to bridge the gap between signature-based alerts and probabilistic AI behaviour. Prototyping: Small-scale tests were performed to determine the hardware and software limitations of running a Flask web server alongside real-time network sniffers.
  • 18.
    18 (Figure 3.1 Waterfallmethodology) 3.4 Project Plan (Gantt Chart) The project followed a strict timeline to ensure all components from the data pre-processing all the way to the web interface in figure 3.2: - (Figure 3.2 Gantt Chart tasks timeline) 3.5 Development Tools and Technologies The project utilizes a modern Python-based stack designed for high-performance machine learning and data processing.
  • 19.
    19 Data Pre-processing: - ProgrammingLanguage (Python): Chosen for its extensive ecosystem of security and AI libraries. Scikit-Learn (Library): Used to build the "Fail-Safe" Pipeline, including the Column Transformer and Voting Classifier. Joblib (Persistence): Used to produce the .pkl file (pickle), allowing the model to be saved and loaded into a real-time environment without retraining. Pandas & NumPy: Essential for handling the large-scale data structures of network traffic logs. Random Forest & PCA: Random Forest serves as the core intelligence layer, utilizing an ensemble of decision trees to classify network traffic through a majority voting mechanism and the PCA was implemented to reduce the high dimensionality of the UNSW-NB15 dataset by projecting it into a lower-dimensional space while retaining 95% of the variance. Matplotlib & Seaborn: Used to generate the "Intelligence Performance Summary" and "Model Percentages" charts. Google Colab: The primary development environments for writing and testing the dataset pre-processing. Implementation: - Operating System (Linux Ubuntu): The main environment used for integrating the model, providing a stable platform for running concurrent security services. Network Analyzer (Zeek): Used for understanding traffic behaviour and generating high-level metadata logs. Zeek transforms raw packets into structured data that the AI can interpret. Intrusion Detection System (Snort): Implemented for signature-based detection. Snort applies predefined rules to the traffic stream to catch known malicious patterns instantly. Integrated Development Environment (PyCharm): The central IDE used to write the integration code, managing the complex logic required to merge Zeek/Snort outputs into a single Python application. Web Framework (Flask): Used to develop the final dashboard. Flask acts as the "Presentation Layer," pulling data from the AI model to show real-time "Normal vs. Attack" percentages to the user SIEM dashboard.
  • 20.
    20 Chapter 4 –System Specification 4.1 Introduction This chapter describes the UNSW-NB15 network intrusion dataset used in the project. The following sections explain the methodology used to extract features from the dataset and the step-by-step workflow for data preparation. It details the process of training the system using Random Forest and Support Vector Machine (SVM) algorithms based on behavioural analysis. After the dataset is processed and verified, the findings from the AI's feature importance analysis are used to generate manual Snort rules. This hybrid approach allows the system to identify both known signatures and abnormal behaviour. As figure 4.1 illustrates the general stages of the methodology, from data pre-processing to final performance evaluation. (Figure 4.1 System chart)
  • 21.
    21 4.2 Functional Requirements DatasetSelection: The dataset used in this project is the UNSW-NB15, a modern network intrusion dataset. Unlike older datasets (like KDD99), UNSW-NB15 contains a comprehensive variety of contemporary synthesized attack activities and normal traffic. The dataset consists of over 250,000 samples categorized into nine attack types (Fuzzers, Exploits, DoS, and Generic) and Normal traffic. It was obtained in CSV format, containing 45 features including duration, protocol type, and packet counts. Data Preprocessing: In this stage, the raw network logs are cleaned and transformed. Redundant data is removed, and categorical strings (like "TCP" or "UDP") are converted into numerical values that the AI can understand. I. Cleaning: To ensure the model does not crash, an imputation strategy was used. Any missing values (NaNs) or undefined records were handled using SimpleImputer. Numerical gaps were filled with the median value, and categorical gaps were filled with an "unknown" placeholder. II. Feature Encoding & Normalization: Categorical features were processed using One-Hot Encoding, and numerical features were scaled using StandardScaler to ensure that high-value features (like sbytes) did not overshadow smaller values (like dur). III. Balancing: The dataset was stratified during the split to ensure that both "Normal" and "Attack" classes were proportionally represented in both training and testing phases. Feature Selection (RF + PCA): To achieve maximum efficiency in a live SIEM (Security Information and Event Management) environment, a two-step reduction process was used: 1. Random Forest (RF): First, RF was used to determine the "Feature Importance." This identifies which network attributes (like sttl or sbytes) have the most significant impact on detecting an attack. 2. Principal Component Analysis (PCA): Following RF, PCA was applied to reduce dimensionality. We retained 95% of the cumulative variance, which compressed the wide dataset into a core set of 22 Principal Components. This reduces processing time without sacrificing the accuracy of the detection engine.
  • 22.
    22 In this (4.2)figure, we demonstrate the most important indicators for detecting the attacks: - (Figure 4.2: Top 10 Network features attack indicators) The Random Forest model works by asking a series of "Yes/No" questions (Decision Trees). • A feature is "Important" if it is the best at splitting the data into two clean groups (Normal vs. Attack). • If a feature like sttl (Source TTL) can instantly separate 80% of attacks from normal traffic, the AI gives it a high "Importance Score." Mathematically, this is often measured by the Gini Impurity index:
  • 23.
    23 4.3 Logical system RandomForest (RF): Random Forest is an ensemble learning method that builds multiple decision trees and merges them to get a more accurate and stable. In our model, RF acts as the primary "Heavy Hitter" classifier. It is highly resistant to overfitting and excels at identifying complex patterns in high-dimensional network data. Support Vector Machine (SVM): SVM is a supervised learning model that finds the optimal hyperplane to separate classes in a multi-dimensional space. It is particularly effective for binary classification of network traffic where the boundary between "Normal" and "Attack" is narrow. Figure (4.3) presents the Receiver Operating Characteristic (ROC) curve of the proposed AI-based intrusion detection system. The ROC curve illustrates the relationship between the True Positive Rate (TPR) and the False Positive Rate (FPR) at various classification thresholds. The solid curve represents the performance of the Random Forest classifier, while the diagonal dashed line indicates the performance of a random classifier. The model achieves an Area Under the Curve (AUC) of 0.990, which demonstrates excellent discriminative capability between normal and attack traffic. (Figure 4.3 ROC curve)
  • 24.
    24 Random Forest Results:After applying PCA, the Random Forest model demonstrated outstanding performance. It achieved an overall Accuracy of 94.82%. The model showed high precision, meaning it generated very few false alarms, making it suitable for a real-time SOC environment. SVM Results: The SVM model achieved an Accuracy of 93.32%. While slightly lower than RF, the SVM provided a robust secondary validation, particularly in identifying "Stealth" attacks that involve low packet counts. As shown in figure (4.3) : - (Figure 4.3 Percentages) The figure 4.4 diagram illustrates the integrated workflow of the hybrid detection system, where live network traffic is captured and simultaneously processed through two primary channels. On the left, Snort applies rules to identify known malicious patterns, while on the right, the Machine Learning pipeline (Random Forest/SVM) performs behavioural analysis on the UNSW-NB15 dataset features. The central integration module consolidates these findings to provide a comprehensive security verdict, which is then visualized in real-time via the administrative dashboard.
  • 25.
  • 26.
    26 Chapter 5 –Implementation & Testing 5.1 Introduction This chapter details the transition of the project from its theoretical and architectural design into a functional, multi-layered security prototype for Improving Snort Techniques Using Machine Learning. The implementation phase focuses on the integration of established network monitoring tools, specifically Snort and Zeek, with a custom Python-based AI backend. By utilizing the UNSW-NB15 dataset and the machine learning models, specifically Random Forest and KNN, specified in the previous chapter, this phase demonstrates how raw network traffic is transformed into actionable security intelligence. Figure 5.1 demonstrates the libraries used for starting the pre-processing phase: - (Figure 5.1 Data Pre-processing)
  • 27.
    27 Figure 5.2 illustratesthe result of the pre-processing percentages in which contains (Accuracy, Recall, Precision, F1-score) (Figure 5.2 Results) Figure 5.3: The start of the integration between the AI model (intrution_model.pkl), Snort alerts (snort/alert) and the Zeek network monitoring file (conn.log) ( Figure 5.3 Model integration )
  • 28.
    28 Figure 5.4: Byusing the Flask framework, we made an app.py file for creating the events dashboard on a specific port ( Figure 5.4 App Flask ) Figure 5.5: Displaying the Snort rules we have installed based on the attacks best captured by our model (Figure 5.5 Snort Rules)
  • 29.
    29 (Figure 5.6 Sign-inUser interface) Figure 5.7: Presenting the final GUI events dashboard in which also shows some events testing (Figure 5.7 Home Dashboard)
  • 30.
    30 Figure 5.8: Afteran incident, the in-charged person can download a Report.pdf in which will present the final 10 attacks with its meta-data (Figure 5.8 Report file last events) 5.2 Testing phase Figure 5.2.1 shows the launch of a PS script using the secure shell port for testing the AI model for detecting a high volume of connections as a potential DoS/Probing attack (Figure 5.2.1)
  • 31.
    31 Figure 5.2.2 Displayspost-incident, how will the event log be presented for the user (Figure 5.2.2)
  • 32.
    32 Chapter 6 –Future work & Conclusion 6.1 Introduction This chapter demonstrates a comprehensive analysis of the results obtained from the development of this project. The chapter aims to discuss the data collected during the implementation phase, evaluate the effectiveness of the system through testing, and analyse testing to provide context for the findings. The overall goal is to gain insights into the strengths and weaknesses of the developed system and to understand the extent to which it meets its intended objectives. 6.2 Future Work While this study successfully demonstrated the integration of the Ai models with snort & zeek for threat detection, several avenues for future research remain: • Real-time Deployment and Scalability: Future research could focus on transitioning the current model from a controlled environment (Google Colab) to a live, high-speed network environment. Investigating the computational overhead of deep learning models in real-time "inline" deployments would be critical for industry adoption. • Expansion of Datasets: While the UNSW-NB15 dataset provides a robust foundation, future iterations of this work should incorporate more recent datasets (such as CSE-CIC-IDS2020) to account for evolving exploits and modern encrypted traffic patterns. • Adversarial Machine Learning: A significant area for further study is the resilience of AI models against adversarial attacks, where attackers specifically craft traffic to bypass neural network detection. 6.3 Conclusion This project successfully enhanced traditional intrusion detection by integrating Machine Learning with Snort to address the limitations of signature-based systems. Through the application of Random Forest and SVM models on the UNSW- NB15 dataset, the system demonstrated a superior ability to identify the specified threats. The implementation proved that behavioural analysis significantly reduces false positives while maintaining high detection accuracy across diverse network traffic. Findings confirm that this hybrid approach provides a more robust and scalable defence mechanism than conventional methods alone. While the current prototype shows high efficiency, future work could explore deep learning architectures to further automate threat mitigation. Ultimately, this research bridges the gap between static rule sets and dynamic AI-driven security, offering a more resilient framework for modern cybersecurity environments.
  • 33.
    33 References: - Academic Journals& Papers • Moamin, S. A., Abdulhameed, M. K., Al-Amri, R. M., Radhi, A. D., Naser, R. K., & Pheng, L. G. (2025). Artificial Intelligence in Malware and Network Intrusion Detection: A Comprehensive Survey of Techniques, Datasets, Challenges, and Future Directions. Babylonian Journal of Artificial Intelligence, 2025, 77-98. • Jonathan J. Davis and Andrew J. Clark. 2011. Data preprocessing for anomaly based network intrusion detection: Areview. Comput. Secur. 30, 6 (2011), 353–375. • H. J. Jeong, W. Hyun, J. Lim, and I. You. 2012. Anomaly teletraffic intrusion detection systems on Hadoop-basedplatforms: A survey of some problems and solutions. In 15th International Conference on Network-based InformationSystems. 766–770 • H. E. Poston. 2012. A brief taxonomy of intrusion detection strategies. In IEEE National Aerospace and ElectronicsConference. 255–263. • Chou, D., & Jiang, M. (2021). A survey on data-driven network intrusion detection. ACM Computing Surveys (CSUR), 54(9), 1-36. • Moustafa, N., & Slay, J. (2015, November). UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In 2015 military communications and information systems conference (MilCIS) (pp. 1-6). IEEE. • Shohan, N. J., Tanbhir, G., Elahi, F., Ullah, A., & Sakib, M. N. (2023, December). Enhancing network security: A hybrid approach for detection and mitigation of distributed denial-of-service attacks using machine learning. In International Conference on Advanced Network Technologies and Intelligent Computing (pp. 81-95). Cham: Springer Nature Switzerland.
  • 34.
    34 Company Technology &Software • Bfore.Ai. (n.d.). Predictive domain scoring and pre-emptive threat intelligence: Technology overview. https://bfore.ai/ • Darktrace. (n.d.). Autonomous response and real-time behavioural clustering: Cyber AI research. https://www.darktrace.com/ • Google. (n.d.). Google Colab. https://colab.google/ • PythonAnywhere. (n.d.). PythonAnywhere host services. https://www.pythonanywhere.com/ • SafeBreach. (n.d.). Attack emulation and defensive gap assessment: Platform features. https://www.safebreach.com/ • ThetaRay. (n.d.). Deep learning and statistical analysis for large-scale data. https://thetaray.com/ Methodology • Lucid Software. (n.d.). Waterfall methodology overview. https://www.lucidchart.com/