Real Time Vehicle Detection for Intelligent Transportation Systems

DEGREE PROJECT
Real Time Vehicle Detection for Intelligent
Transportation Systems
Christián Ulehla
Elda Shurdhaj
Master Programme in Data Science
2023
Luleå University of Technology
Department of Computer Science, Electrical and Space Engineering

[This page intentionally left blank]

ABSTRACT
This thesis aims to analyze how object detectors perform under winter weather conditions,
specifically in areas with varying degrees of snow cover. The investigation will evaluate the
effectiveness of commonly used object detection methods in identifying vehicles in snowy
environments, including YOLO v8, Yolo v5, and Faster R-CNN. Additionally, the study explores
the method of labeling vehicle objects within a set of image frames for the purpose of high-quality
annotations in terms of correctness, details, and consistency. Training data is the cornerstone upon
which the development of machine learning is built. Inaccurate or inconsistent annotations can
mislead the model, causing it to learn incorrect patterns and features. Data augmentation
techniques like rotation, scaling, or color alteration have been applied to enhance some models'
robustness to recognize objects under different alterations. The study aims to contribute to the field
of deep learning by providing valuable insights into the challenges of detecting vehicles in snowy
conditions and offering suggestions for improving the accuracy and reliability of object detection
systems. Furthermore, the investigation will examine edge devices' real-time tracking and
detection capabilities when applied to aerial images under these weather conditions. What drives
this research is the need to delve deeper into the research gap concerning vehicle detection using
drones, especially in adverse weather conditions. It highlights the scarcity of substantial datasets
before Mokayed et al. published the Nordic Vehicle Dataset. Using unmanned aerial vehicles
(UAVs) or drones to capture real images in different settings and under various snow cover
conditions in the Nordic region contributes to expanding the existing dataset, which has previously
been restricted to non-snowy weather conditions. In recent years, the leverage of drones to capture
real-time data to optimize intelligent transport systems has seen a surge. The potential of drones
in providing an aerial perspective efficiently collecting data over large areas to precisely and timely
monitor vehicular movement is an area that is imperative to address. To a greater extent, snowy
weather conditions can create an environment of limited visibility, significantly complicating data
interpretation and object detection. The emphasis is set on edge devices' real-time tracking and
detection capabilities, which in this study introduces the integration of edge computing in drone
technologies to explore the speed and efficiency of data processing in such systems.
Key Concepts: Unmanned aerial vehicles (UAVs), vehicle detection, CNN architecture, snow
cover, NVD, Yolov5s, Yolov8s, Faster R-CNN, real-time processing, edge devices.

Dedication and Acknowledgments
First and foremost, we convey our profound appreciation to our advisor, Hamam Mokayed, whose
expertise, understanding, and patience, added considerably to our graduate experience. His
mentorship, encouragement, and candid criticism have been invaluable.
We would also like to express our heartfelt thanks to the NVD team, especially Amirhossein
Nayebiastaneh, for their unwavering support throughout the thesis. Their provision of resources
and expertise significantly enriched our research.
In the spirit of true collaboration, we extend our deepest gratitude to each other as co-authors of
this thesis. This journey has been a testament to our shared dedication, mutual respect, and
collective pursuit of excellence. Sharing the challenges and hard work made this journey all the
more rewarding.
To our incredible families and loved ones,
This thesis would not have been possible without your unwavering support, occasional pep talks,
and endless supply of coffee (and cookies!). You've been the cheerleaders in the corner and the
providers of continuous flow of care packages that truly kept us going.
Thank you all!
Elda and Chrisitián

CONTENTS
CHAPTER 1 ................................................................................................................................ 1
1. Introduction ..................................................................................................................... 1
1.1 Background................................................................................................................... 1
1.2 Motivation ..................................................................................................................... 3
1.3 Problem Definition......................................................................................................... 5
1.4 Goals and Delimitations ................................................................................................ 6
1.5 Research Methodology ................................................................................................. 7
1.6 Structure of the Thesis .................................................................................................. 9
CHAPTER 2 ...............................................................................................................................12
2. Contextual Framework ...................................................................................................12
2.1 Computer Vision and Vehicle Detection .......................................................................12
2.2 Deep Learning Techniques for Vehicle Detection.........................................................16
2.3 Aerial Image Detection Challenges ..............................................................................26
2.4 Real-Time Tracking and Detection on Edge Devices ...................................................28
2.5 Evaluation Metrics and Benchmarking .........................................................................29
2.6 Relevant Studies, Research Gaps and Future Directions.............................................36
CHAPTER 3 ...............................................................................................................................39
3. Thesis Methodology Approach .......................................................................................39
3.1 Dataset Acquisition and Description.............................................................................39
3.2 Data Splitting................................................................................................................41
3.3 Frame and Annotation Processing ...............................................................................42
3.4 Training Techniques, Models .......................................................................................45
3.5 Edge Device Application ..............................................................................................48
3.6 Evaluation Metrics and Benchmarking .........................................................................50
3.7 Ethical Considerations .................................................................................................54
CHAPTER 4 ...............................................................................................................................56
4. Analysis, Experiments and Results.................................................................................56
4.1 Data Analysis and Experiments....................................................................................56
4.2 Results.........................................................................................................................63

CHAPTER 5 ...............................................................................................................................64
5. Discussion, Final Remarks, and Prospects for Future Research ....................................64
5.1 Discussion ...................................................................................................................64
5.2 Conclusions .................................................................................................................66
5.3 Future Work .................................................................................................................67
References ..............................................................................................................................69

List of Abbreviations
CNN Convolutional Neural Networks
CV Computer Vision
CVAT Computer Vision Annotation Tool
DOTA Dataset of Object Detection in Aerial images
NVD Nordic Vehicle Dataset
OBB Oriented Bounding Boxes
R-CNN Region-based Convolutional Neural Networks
ReLU Rectified Linear Unit
UAV Unmanned Aerial Vehicles
YOLO You Only Look Once

List of Table and Figures
Table 1. Specification of Freya unmanned aircraft Mokayed et al. ............................................40
Table 2. Specification of Freya unmanned aircraft Mokayed et al. ............................................41
Table 3. Summary of NVD dataset processing..........................................................................42
Table 4. Specifications of PC on which model was trained on...................................................48
Table 5. Results of selected detection models. .........................................................................63
Figure 1. Machine Learning Family ...........................................................................................16
Figure 2. YOLO evolution..........................................................................................................20
Figure 3. Faster R-CNN architecture.........................................................................................21
Figure 4. Illustration of precision-recall curve ............................................................................31
Figure 5. Intersection over Union ..............................................................................................33
Figure 6. mAP flowchart calculation ..........................................................................................35
Figure 7. Steps how to calculate mAP ......................................................................................35
Figure 8. Confusion Matrix ........................................................................................................36
Figure 9. Showcase of dataset images .....................................................................................40
Figure 10. Difference between inner bounding boxes with rotation and outer bounding boxes
without rotation.........................................................................................................43
Figure 11. Illustration of gaussian elliptical function over heatmap ............................................44
Figure 12. From left to right: original image, produced heatmaps, overlay of both.....................45
Figure 13. Example of Yolo model sizes ...................................................................................45
Figure 14. CNN network architecture made as pytorch nn.Module............................................46
Figure 15. Our CNN network architecture .................................................................................47
Figure 16. Image of chosen edge-device: Raspberry Pi............................................................50
Figure 17. Difference between predicted heatmap (left) and heatmap after threshold mapping
(right)........................................................................................................................51
Figure 18. Process of squaring the results of heatmap after threshold mapping........................51
Figure 19. From top-left to bottom-right: Squared heatmap after applying threshold, squared
heatmap before applying threshold, Replacement of sizes, Final removal of small
bounding boxes........................................................................................................52
Figure 20. From left to right: Overlay of ground-truth bounding boxes with image and overlay of
predicted bounding boxes with image.......................................................................53
Figure 21. Illustration of general effect of varying learning rates................................................57
Figure 22. Showcase when model stopped at sub-optimal solution of outputting all-0 heatmap.
Training/Validation loss stopped decreasing and output heatmap is black ................................58
Figure 23. Effect of different learning rates on training of our network. Graphs describe evolving
training and validation loss over epochs....................................................................................59

Figure 24. Training/validation loss of our final model during training .........................................61
Figure 25. Difference in size between actual bounding boxes (left) and predicted bounding
boxes (right) .............................................................................................................61
Figure 26. Confusion matrix of whole testing dataset for 0.9 quantile thresholding....................62
Figure 27. Difference between ground-truth (left) and predictions (right) on image from sunny
video, showcasing problem of our model with generalization ...................................65

1
CHAPTER 1
Thesis Introduction
1. Introduction
1.1 Background
In recent years, the field of computer vision has witnessed remarkable progress, driven mainly by
the rapid advancements in deep learning techniques (Mokayed & Mohamed, 2014), such as
document classification analysis (Kanchi et al., 2022) or medical studies (Voon et al., 2022).
Among the numerous applications of computer vision, vehicle detection plays a crucial role in
diverse domains, such as intelligent transportation systems (Dilek & Dener, 2023), autonomous
driving, traffic management, surveillance systems (Mokayed et al., 2023), environmental
monitoring (Sasikala et al., 2023; Zhou et al., n.d.,), document analysis (Khan et al., 2022;
Mokayed et al., 2014), and many other fields. Accurate and efficient detection of vehicles in real-
world scenarios poses significant challenges due to varying weather conditions, complex
backgrounds, occlusions, and diverse vehicle appearances (Mokayed et al., 2022; Farid et al.,
2023). To address these challenges, researchers have turned to the power of deep learning,
specifically Convolutional Neural Networks (CNNs), as a promising solution.
The ubiquity of vehicles in our daily lives underscores the significance of efficient vehicle
detection systems in ensuring safety, optimizing traffic flow, and enabling intelligent
transportation systems. Traditional approaches to vehicle detection often relied on handcrafted
features and rule-based algorithms, which limited their adaptability to varying environmental
conditions. Moreover, as the complexities of real-world scenarios increased, the demand for more
accurate and sophisticated detection techniques became evident (Xu et al., 2021)

2
Real-time processing in computer vision means analyzing visual data like images and video frames
instantly without any significant delay. This fast analysis helps make quick decisions based on the
camera data. Real-time processing ensures that vehicles can be detected and tracked accurately
and efficiently, providing valuable information for tasks such as traffic management, autonomous
driving, and surveillance systems (Dilek & Dener, 2023). By processing images or video streams
in real time, computer vision algorithms can quickly identify and classify vehicles, enabling timely
actions to be taken.
Deep learning, a subfield of machine learning, has emerged as a transformative force in computer
vision due to its ability to learn intricate patterns and representations directly from data.
Convolutional Neural Networks (CNNs), particularly, have showcased exceptional performance
in various image recognition tasks, including object detection (Arkin et al., 2023). Leveraging
CNNs for vehicle detection presents a promising solution to overcome conventional methods'
challenges, thus driving advancements in this domain (Arkin et al., 2023)
The quality of images plays a crucial role in how well deep learning Convolutional Neural
Networks (CNNs) perform in image detection tasks (Dong et al., 2016). CNNs rely on high-quality
input images to accurately detect and recognize objects. Ensuring that CNNs are provided with
well-processed, high-quality images is essential for reliable and effective object detection in
practical applications. To ensure the effectiveness and reliability of object detection systems,
studying and improving the high accuracy of detection models is of utmost importance. The
accuracy of models directly affects their ability to identify and classify objects correctly within
images or videos. Object detection models can provide accurate and precise information about the
presence, location, and type of objects by achieving high accuracy.
Detecting and recognizing vehicles in drone images enhances safety in various applications.
Drones have significantly impacted image detection in deep learning by offering a unique aerial
perspective, enabling efficient data collection over large areas. Equipped with high-resolution
cameras, drones provide detailed imagery, facilitating precise object detection and recognition in
aerial images, even in real-time.

3
However, detecting vehicles in drone-captured images poses challenges, especially when the
images are taken at oblique angles. These challenges include non-uniform illumination,
degradation, blurring, occlusion, and reduced visibility (Mokayed et al., 2023; Mokayed et al.,
2021). Additionally, adverse weather conditions, such as heavy snow, further complicate vehicle
detection in snowy environments due to limited visibility and data interpretation difficulties.
Despite the progress made in object detection in natural images, the application to aerial images
remains distinct and poses unique challenges (Ding et al., 2022). Additionally, there is a lack of
sufficient data and research specifically focused on detecting vehicles in snowy conditions using
images captured by unmanned aerial vehicles (UAVs), highlighting an important research gap in
this field (Mokayed et al., 2023). Addressing these challenges and further investigating vehicle
detection in snowy conditions using drones is crucial to advance the capabilities of deep learning
models and improve safety in various domains.
1.2 Motivation
The lack of sufficient data and research on vehicle detection using drones in snowy conditions,
underscores this study's need. By training on a unique dataset of vehicles captured by UAVs in
diverse snowy weather conditions in the Nordic region, the thesis aims to advance the field. The
goal is to shed light on vehicle detection challenges in snowy conditions using drones and offer
valuable insights to improve the accuracy and reliability of object detection systems in adverse
weather scenarios. Moreover, the research aims to explore edge devices' real-time tracking and
detection capabilities when applied to aerial images under snowy conditions.
Mokayed et al. tackle the complex problem of vehicle detection and recognition in drone images,
explicitly focusing on challenging weather conditions like heavy snow. Their study highlights the
significance of using appropriate training datasets for developing effective detectors, especially in
adverse weather conditions. They provide the scientific community with a unique dataset captured
by unmanned aerial vehicles (UAVs) in the Nordic region, encompassing diverse snowy weather
conditions such as overcast with snowfall, low light, patchy snow cover, high brightness, sunlight,
and fresh snow, along with temperatures below -0 degrees Celsius.

4
Building upon the findings of Mokayed et al., this thesis aims to further investigate the
performance of object detectors under different winter weather conditions with varying degrees of
snow cover. Additionally, it seeks to evaluate the effectiveness of widely used object detection
methods, such as YOLO v8, YOLO v5, and Faster RCNN, in detecting vehicles in snowy
environments. YOLO and Faster RCNN are some of the most widely used models for object
detection tasks, including aerial vehicle detection. YOLO’s newer versions, specifically v8 and
v5, stand out due to their speed, enhanced detection accuracy, and robustness. On the other hand,
Faster-RCNN, though not as fast as YOLO, it still offers relatively quick detection times compared
to the earlier RCNN models, which is vital for real-time applications. Furthermore, the
investigation explores data augmentation techniques to enhance detection performance in
challenging snowy scenarios.
Image detection from aerial images on edge devices is pivotal to this research. Edge devices refer
to computing devices or systems located close to the data source, such as onboard processors on
the drone or specialized hardware integrated into the camera system. By performing image
processing locally on the edge device, the thesis aims to exploit the benefits of reduced latency
and bandwidth optimization, which are crucial in real-time or near-real-time applications. In real-
world applications, due to adverse weather in northern regions, there might be a necessity for "on-
board" detection for real-time results because there might not be a stable connection. This approach
aligns with the intention to improve the efficiency and effectiveness of object detection in aerial
images.
In summary, the motivation behind this thesis lies in addressing the research gap in vehicle
detection under snowy conditions and advancing the field of deep learning-based image detection
in adverse weather scenarios. By leveraging the work of Mokayed et al. and presenting a novel
dataset, this study aspires to contribute to developing more robust and efficient detection
techniques with potential applications in safety, surveillance, and other critical domains.
Moreover, the investigation into edge devices' detection capabilities adds a valuable dimension to
the research, aiming to enhance real-time capabilities and optimize resource utilization in aerial
image processing.

5
1.3 Problem Definition
As mentioned in the motivation section of this thesis, it aims to address the challenges of vehicle
detection and recognition in drone images under various snowy weather conditions in the Nordic
region on edge devices. By analyzing the performance of existing methods, understanding the
impact of adverse weather, evaluating detection under diverse snowy scenarios, and exploring data
augmentation techniques, the thesis intends to advance the field of deep learning-based image
detection and contribute to safer and more efficient transportation systems and surveillance
applications. Additionally, investigating the application of edge devices for real-time tracking and
detection will further enhance the potential applications of the research findings in real-world
scenarios.
The following research questions will guide the investigation of this thesis:
1. How well do STOA vehicle detectors operate on hardware-constrained devices required
for real-time scenarios?
2. How does the performance of the introduced methodology in this work compare to existing
state-of-the-art techniques in terms of both accuracy and processing speed?
3. Which of the detection models can be effectively employed for real-time scenarios on the
edge device we have selected?
Research Objectives
Evaluation of Existing Techniques:
The first research question aims to evaluate the effectiveness of widely used vehicle detection
techniques, namely YOLO v8, YOLO v5, and Fast RCNN, in snowy weather conditions on edge
devices. By comparing the performance of these methods, the thesis seeks to identify which
approach is better suited for detecting vehicles in snowy environments, contributing to the
advancement of vehicle detection technology.

6
Assessing the comparative performance:
The second research question seeks to quantify the degree to which the proposed methodology
excels or needs to be revised in correctly identifying vehicles in snowy weather conditions. This
assessment is important for determining whether the proposed approach represents a significant
improvement over existing accuracy and processing speed techniques.
Practical application:
The third research question focuses on the practical application of the detection models on chosen
edge devices in real-time scenarios. This question is crucial in determining the feasibility and
usability of the proposed methodology in real-world settings. This information is vital for
understanding the potential impact and usability of the research findings in practical settings,
ultimately contributing to the advancement of vehicle detection technology in snowy conditions.
1.4 Goals and Delimitations
This thesis may encounter certain limitations despite its careful and rigorous academic research
and well-crafted research questions designed to explore various aspects of aerial image detection
on edge devices. These limitations can encompass the choice of object detection methods, the
variability of edge devices, and the practical deployment of the proposed techniques in real-world
scenarios.
While this thesis evaluates popular object detection techniques such as YOLO v8, YOLO v5, and
Faster R-CNN, it is essential to acknowledge that the field of computer vision offers a wide array
of algorithms and architectures. By focusing on specific methods, other potentially valuable
alternatives may not be thoroughly investigated, potentially leaving out valuable insights.
The investigation into edge devices' real-time tracking and detection capabilities of edge devices
assumes a certain level of uniformity among these devices. However, in practice, edge devices can
exhibit significant variations in terms of processing power, memory, and overall efficiency. These
differences might impact the generalizability of the findings to other edge device configurations.

7
Although this research concentrates on the technical aspects of vehicle detection using drones and
edge devices in snowy conditions, the practical deployment of these methods in real-world settings
comes with its own set of challenges and operational constraints. The real-world environment can
introduce complexities that extend beyond the scope of this thesis, warranting further consideration
for successful implementation.
Despite these limitations, the thesis contributes valuable insights into aerial image detection on
edge devices, offering potential advancements in object detection systems. By acknowledging
these potential constraints, the research provides a foundation for further investigation and future
developments in the field, paving the way for improved safety, surveillance, and transportation
systems.
1.5 Research Methodology
The research methodology employed in this thesis encompasses several key steps to achieve the
objectives of real-time vehicle detection in snowy conditions using aerial images on edge devices.
The methodology includes data alteration, dataset augmentation, model training and validation,
setting up the edge environment, testing detection models on edge devices, and model evaluation
or benchmarking.
Dataset Collection:
The data collection process is described by Mokayed et al., as the dataset over which the detection
models in this thesis are trained and tested is the NVD from their paper. This step described how
data were collected, the features of the data, and the technology used for performing the action.
Dataset Alteration for Generating Heatmaps - Labels for Custom CNN:
Heatmaps are generated and used as labels for training a custom Convolutional Neural Network
(CNN) in this step. These heatmaps indicate the presence and location of vehicles within the
images, serving as ground truth data for CNN. The custom CNN is trained to detect and recognize
vehicles in snowy conditions effectively by incorporating these heatmaps as labels.

8
Dataset Augmentation for Training:
To address the challenge of data scarcity, dataset augmentation techniques are employed in certain
models to expand the dataset size and diversity. Augmentation methods may involve random
rotations, flipping, and adjustments in brightness and contrast. These augmentations are intended
to enhance the training dataset to improve the model's ability to generalize during the training of
vehicle detection.
Model Training and Validation:
The custom CNN, along with other chosen object detection methods like YOLO v8, YOLO v5,
and Faster RCNN, undergoes training on the augmented dataset. The training process involves
optimizing the model's parameters using suitable optimization algorithms and loss functions. To
ensure the models' generalization and prevent overfitting, validation is performed using a separate
validation dataset.
Setting Up the Edge Environment:
In this phase, the edge environment is established to evaluate the models' real-time tracking and
detection capabilities. Setting up the edge environment using Raspberry Pi is a critical component
of the methodology as it enables evaluating the vehicle detection models in real-world conditions
with limited resources and processing capabilities. It helps ensure that the models are practical,
efficient, and capable of real-time tracking and detection. These are essential factors for successful
deployment in various applications such as drones, surveillance systems, and environmental
monitoring.
Testing the Detection Models on Edge Devices:
The trained detection models are deployed and tested on the edge devices within the established
environment. The models' performance is evaluated regarding their real-time processing
capabilities, accuracy in detecting vehicles under snowy conditions, and resource utilization.

9
Model Evaluation or Benchmarking:
Evaluation and benchmarking are conducted to assess the detection models' effectiveness and
robustness. The models are tested on a separate test dataset containing real-world drone images
with varying weather conditions. Performance metrics, such as precision, recall, average precision,
and time, are computed to gauge the models' detection accuracy and reliability.
1.6 Structure of the Thesis
Chapter 1: Introduction
This chapter serves as an introductory section to the thesis, providing an in-depth overview of the
research study and establishing its contextual framework. The focal subject revolves around
contemporary advancements in vehicle detection techniques under diverse weather conditions,
specifically examining image data acquired through unmanned aerial vehicles.
The problem statements and the research questions/objectives outline the study's aim. This chapter
includes the motivation for why this research is essential and how it contributes to the existing
research on vehicle detection under heavy weather conditions. Finally, Any limitations that may
affect the research, the methodology outline, and the organization of the thesis are also part of this
section.
Chapter 2: Literature Review
This chapter provides a comprehensive overview and analysis of the existing literature and
research related to the thesis topic. It aims to demonstrate the authors' knowledge of the field and
the gaps in current knowledge that the thesis intends to address. This section focuses on the
challenges specifically related to using Unmanned Aerial Vehicles (UAVs) for image capture and
analysis like image resolution, motion blur, occlusions, and other factors that affect vehicle
detection using UAV imagery. Here, the literature on vehicle detection under various weather
conditions is examined. Different studies and methods that address weather's impact on detection
algorithms' accuracy are discussed. In this part, various object detection algorithms are reviewed.
This includes deep learning-based approaches like YOLO (You Only Look Once) and Fast RCNN

10
(Region-based Convolutional Neural Networks). This section explores different data augmentation
techniques to enhance the training dataset for vehicle detection models. Data augmentation helps
increase the data's diversity, improve model generalization, and handle imbalanced datasets.
Chapter 3: Methodology
This section outlines the specific environment in which the research is conducted. It also explains
the data collection process, including the sources of data and any alterations made to the dataset to
suit the research requirements. It also provides a detailed account of how Unmanned Aerial
Vehicles (UAVs) data was collected. It explains the procedures and equipment used to capture the
images or videos, the flight parameters, and any other relevant details regarding the data
acquisition process. Here, the characteristics and properties of the dataset used in the research are
outlined. It includes information about the size of the dataset, the number of samples, the
distribution of data across different categories or classes, and any specific attributes or labels
associated with the data.
This part delves into the various object detection methods used in the thesis. It provides detailed
explanations of each method, including YOLO v8, YOLO v5, and Faster R-CNN, and how they
function in the research context. The methodology chapter serves as a roadmap for the research,
outlining how the data is collected, how the object detection methods are chosen and implemented,
and how the models are trained and validated. Additionally, it provides details about the study
area, the data sources, and any alterations made to the dataset. The chapter concludes by explaining
how the detection models are tested on edge devices, evaluated, or benchmarked, and how the
edge environment is set up for testing and analysis.
Chapter 4: Analysis, Experiments, and Results
The "Analysis, Experiments, and Results" section in the thesis details the practical aspects of the
research study. It involves conducting experiments to evaluate the performance of different object
detection methods and presenting the findings obtained based on the setup environment on the
chosen edge device.
Various evaluation metrics are introduced in this section to assess the effectiveness of the object
detection methods. These metrics, such as precision, recall, accuracy, and mean average precision

11
(mAP), are commonly used to measure the methods' performance. This section comprehensively
examines and discusses the individual performances of three specific object detection methods are
comprehensively examined and discussed here. Each method's results are presented separately to
showcase how well they can detect vehicles under different weather conditions.
Moreover, a comparative analysis is included, where a direct comparison among the three object
detection methods is conducted. This analysis considers the performance metrics, computational
efficiency, and other relevant factors. The objective is to identify the most suitable method for the
specific task. Additionally, the impact of the data augmentation techniques applied in Chapter 3 is
evaluated in this chapter. This evaluation scrutinizes how the data augmentation process influenced
the object detection models' performance and discusses how much it enhanced the overall results.
Chapter 5: Discussion, Conclusions, and Future Work
This chapter addresses the challenges faced during the research process and any limitations that
may have influenced the outcomes. The discussion sheds light on the constraints that might have
impacted the research scope, data collection, or analysis. It also provides insights into potential
areas for improvement or further research.
This section also includes the key findings and the research outcomes summary. The chapter
reiterates the main points discussed throughout the thesis and provides a concise statement of the
conclusions drawn from the research. It answers the research questions or objectives outlined at
the beginning of the thesis and reflects on whether they were successfully addressed. It discusses
how the research findings contribute to the existing knowledge in the field and highlights their
potential impact on theory, practice, or real-world applications. This section may also address any
unexpected or noteworthy discoveries made during the study. The chapter outlines areas that could
benefit from further investigation based on the limitations and gaps identified in the current study.
It may propose extensions to the current research, novel approaches, or new directions to explore
in related fields.
References

12
CHAPTER 2
Literature Review
2. Contextual Framework
2.1 Computer Vision and Vehicle Detection
The genesis of computer vision can be historically pinpointed back to the 1960s. During this
period, the demand for interpreting and assessing visual data led to the advent of pioneering
techniques, enabling computers to identify and categorize patterns and objects within the images
(Wiley & Lucas, 2018; Mokayed et al., 2021). Central to computer vision are operations such as
image processing, object detection, and pattern recognition. The foundational principle of
computer vision lies in its capability to transform raw images into numerical data, subsequently
utilized as computational inputs. Converting images into numerical data involves several steps,
including image acquisition, preprocessing, feature extraction, and often deep learning-based
methods (Spencer et al., 2019).
This transformation process is enabled through the arrangement of images at the pixel level,
wherein each pixel may be characterized by a numerical value, either in grayscale or as a
combination of numerical values (e.g., 255, 0, 0 in the RGB color model) (Tan & Jiang, 2018).
Grayscale images represent each pixel's brightness level using a single numerical value. The value
typically ranges from 0 (black) to 255 (white), with varying shades of gray in between. Grayscale
is commonly used for simplicity and efficiency in image processing (Tan & Jiang, 2018). RGB
color images, on the other hand, use a combination of three numerical values for each pixel,
representing the intensity of red, green, and blue channels.

13
As a sub-discipline of artificial intelligence, computer vision facilitates the capacity of machines
to decode and act upon information derived from visual sources like photos, videos, and other
image-based inputs (Sharma et al., 2021).
One particular area of interest is vehicle detection, which involves identifying and tracking
vehicles in real-time using computer vision techniques.
Several vehicle detection approaches have been proposed, including feature-based, appearance-
based, and deep learning-based methods. Feature-based methods rely on handcrafted features such
as edges, corners, and color histograms to detect vehicles, while appearance-based methods use
templates or classifiers trained on vehicle images to detect them (Nigam et al., 2023). On the other
hand, deep learning-based methods use neural networks to learn representations of vehicle features
directly from raw image data (Gupta et al., 2021; Chuangju, 2023).
With the advancements in technology and the increase in traffic on roads, vehicle detection has
become an essential part of road safety. The most current and relevant vehicle detection methods
include cameras, drones, and LiDAR technology. Each method has strengths and limitations, and
the most effective method depends on the specific application and environment.
Cameras are one of the most common methods for vehicle detection. They can be mounted on the
road or vehicles themselves and capture images of the road. Computer vision algorithms are then
used to detect vehicles (Quang Dinh et al., 2020). While this method is effective in clear weather
conditions, it can struggle in heavy rain, snow, or bright sunlight.
Drones are another method for vehicle detection that has gained popularity in recent years. Drones
equipped with cameras can capture images of the road from above and use computer vision
algorithms to detect vehicles (Wang et al., 2022). This method effectively detects vehicles in
different environments, such as parking lots or highways. However, it may not be practical in some
situations, such as the interference of buildings in densely populated urban areas, adverse weather
conditions, or even in areas with limited or no internet connectivity.

14
LiDAR technology is another method for vehicle detection that has gained popularity in recent
years. It uses laser beams to detect objects and can create a 3D map of the environment (Zhou et
al., 2022; Khan et al., 2022). This method is effective in detecting vehicles in different
environments, the same as drones. However, it may be expensive and impractical in some
situations.
Radar is another method for detecting vehicles that uses radio waves to detect objects. This method
is effective in all weather conditions but may have limitations in detecting smaller vehicles such
as motorcycles (Li et al., 2022).
Machine learning algorithms are also developed for vehicle detection. These algorithms can learn
to recognize different types of vehicles and adapt to different environments. They are often
combined with cameras, drones, and LiDAR technology to improve vehicle detection accuracy
(Khan et al., 2022). One of the benefits of machine learning algorithms is their ability to adapt to
changing environments. They can learn to recognize different weather conditions, lighting
conditions, and road conditions and adjust their detection methods accordingly. This makes them
particularly effective in areas where conditions may change frequently, such as highways or
construction zones. Another benefit of machine learning algorithms is their ability to detect
partially obscured or difficult-to-see vehicles. For example, they can detect vehicles that are
partially hidden by trees or buildings or that are moving quickly through a crowded area. This can
improve road safety by alerting drivers to potential hazards that they may not have noticed
otherwise. However, there are also some limitations to machine learning algorithms. They require
large amounts of data for training, which can be time-consuming and expensive. Additionally, they
may not be as accurate as other vehicle detection methods in certain situations. For example, they
may struggle to identify vehicles that are very similar in appearance, such as two identical cars
parked next to each other.
Two popular deep learning-based approaches for vehicle detection are the You Only Look Once
(YOLO) algorithm and the Region-based Convolutional Neural Network (Fast R-CNN) algorithm.
YOLO utilizes a single neural network to perform object detection and classification in real-time,

15
while the Faster R-CNN algorithm uses a two-stage process first to propose regions of interest and
then classify them as either vehicles or non-vehicles (Maity et al., 2021).
Several datasets have been established to develop and evaluate vehicle detection algorithms, which
offer a diverse range of real-world scenarios and challenging conditions to test the strength and
accuracy of vehicle detection algorithms. Despite recent advancements, challenges persist in
improving the accuracy and speed of detection in complex environments and adverse weather
conditions.
Some of the most common challenges include overfitting and underfitting, the limitation of labeled
training data, ineffective utilization of hierarchical features, dependency on field experts for
labeling, and limitations inherent to transfer learning (Alzubaidi et al., 2021). Overfitting, a
phenomenon where the CNN model learns too well from the training dataset to be unable to
generalize to new data, typically occurs when training on small datasets or utilizing fewer layers
in the model (Gallagher, 2022). Its counterpart, underfitting, arises when the model lacks the
complexity to discern the necessary features critical for object detection, thereby impeding
accuracy and generalization (Baheti, 2022). Finding the right balance between the complexity of
the model and the amount of training data is key to avoiding these problems.
The architecture of CNN models, especially those with a smaller number of layers, worsens the
problem because they cannot properly use the layered information present in big datasets. This
leads to not-so-great results in tasks involving identifying objects. It is essential, therefore, to
develop CNN models that can make good use of this layered data to boost their effectiveness
significantly.
Lastly, although transfer learning has become a powerful tool to overcome these issues, it still has
downsides. A major problem is the mismatch between the type of data used in transfer learning
and the data in the target dataset. This can potentially reduce the accuracy and effectiveness of
CNN models in spotting objects in various tasks.

16
As we stand on the cusp of autonomous vehicle development, the necessity for real-time object
detection algorithms that can function in dynamic and intricate settings is paramount. This
demands sophisticated algorithms capable of learning from substantial amounts of meticulously
labeled data and high-performance computing hardware to facilitate real-time processing.
2.2 Deep Learning Techniques for Vehicle Detection
Convolutional Neural Network
Convolutional Neural Network is a type of deep learning model in the Artificial Intelligence family
commonly used for various computer vision tasks, including image classification, object detection,
and image segmentation (Datagen, n.d.). CNNs are specifically designed to process and analyze
visual data, making them highly effective in tasks involving images and visual patterns (Taye,
2023). CNNs are designed explicitly for grid-like data, such as images, where the arrangement of
pixels has spatial significance. A CNN learns patterns in the image by using the dependencies
between neighboring cells.
Figure 1. Machine Learning Family (Under the Hood of Machine Learning - The Basics, n.d.)

17
A CNN comprises an input layer, an output layer, and many hidden layers in between. These layers
perform operations that alter the data to learn features specific to the data. Some fundamental
Concepts in CNN for Object Detection (“Fundamental Concepts of Convolutional Neural
Network,” 2019) are:
a. Convolutional Layers: These layers use filters to scan an image and identify specific patterns or
features, such as edges or textures. The convolutional layers slide small filters over the input image,
computing dot products to detect patterns. Each filter specializes in recognizing a specific feature.
b. Activation Function: After convolution, an activation function is applied to the resulting feature
maps, introducing non-linearities that enhance the network's ability to capture complex
relationships. A commonly used activation function is ReLU, which keeps positive values and sets
negative values to zero.
c. Pooling Layers: Pooling layers reduce the spatial dimensions of the features by summarizing
the information in small regions while preserving the essential features.
d. Fully Connected Layer: The fully connected layer is typically found at the end of a CNN and is
responsible for classification. It takes input from the previous layers and generates the final output,
assigning probabilities to different object classes.
e. Region-based CNN (R-CNN): R-CNN is one of the first CNN models designed specifically for
object detection. It uses a sliding window approach to detect objects in an image. The architecture
consists of three modules: region proposal extraction, affine image warping, and category-specific
classification using SVM.
f. Mask R-CNN: Mask R-CNN is an extension of R-CNN that also predicts pixel-level
segmentation masks for objects. It uses a feature pyramid network (FPN) and a bottom-up pathway
to improve the extraction of low-level features. The architecture includes three branches for
bounding box prediction, object class prediction, and mask generation.

18
These fundamental concepts form the building blocks of CNN models for object detection. They
enable the network to extract relevant features, classify objects, and generate accurate predictions.
During training, CNNs adjust their parameters (weights and biases) to minimize the difference
between the predicted class probabilities and the actual labels. This optimization process, called
backpropagation, iteratively updates the parameters to improve the network's accuracy.
In object detection models, CNN layers are utilized to extract features from input images, where
earlier layers detect simple patterns and deeper layers capture more complex features (Szegedy et
al., 2015). Incorporating CNNs into detection models has enabled real-time object detection with
architectures like YOLO (You Only Look Once) that process an entire image with a single CNN
(Redmon et al., 2016).
YOLO
YOLO is a real-time object detection algorithm that was introduced in 2015. It uses a regression
approach instead of classification and associates’ probabilities to detect objects using a single
convolutional neural network to spatially separate bounding boxes and associate probabilities to
each detected object (Jocher et al., 2022). The architecture of YOLO involves resizing the input
image, applying convolutions, and using techniques like batch normalization and dropout to
improve performance. YOLO has evolved over the years, with versions from v1 to v8 addressing
limitations such as detecting smaller objects and unusual shapes. It has gained popularity due to
its accuracy, generalization capabilities, and being open source. In this research, we have decided
to proceed with YOLO v5 and YOLOv8.
The Starting Point
The first version of YOLO, released in 2015, was a game-changer for object detection. However,
it had limitations, such as struggling to detect smaller images within a group and being unable to
detect new or unusual shapes.
Incremental Improvement
YOLOv2, also known as YOLO9000, introduced a new network architecture called Darknet-53,
which improved the accuracy and speed of object detection. It also used logistic regression for

19
better bounding box prediction and introduced independent logistic classifiers for more accurate
class predictions.
Further Enhancements
YOLOv3 built upon YOLOv2 by performing predictions at different scales, allowing for better
semantic information and a higher-quality output image. It also achieved optimal speed and
accuracy compared to previous versions and other state-of-the-art object detectors.
Optimal Speed and Accuracy
YOLOv4 outperformed YOLOv3 in speed and accuracy, making it the most advanced version of
YOLO. It achieved optimal speed and accuracy compared to previous versions and other state-of-
the-art object detectors.
YOLOv5, developed by Meituan, a Chinese e-commerce company, focused on industrial
applications. It introduced a hardware-friendly design, an efficient decoupled head, and a more
effective training strategy, resulting in outstanding accuracy and speed on the COCO dataset.
Hardware-Friendly Design
YOLOv6, also developed by Meituan, continued to improve the hardware-friendly design of
YOLOv5. It introduced a more efficient backbone and neck design, further enhancing accuracy
and speed.
Trainable Bag-of-Freebies
YOLOv7 aimed to improve detection accuracy without increasing training costs. It focused on
increasing both inference speed and detection accuracy, making it a significant improvement over
previous version.
The Pinnacle of Efficiency and Accuracy
Using the improvements made in YOLOv7 as a foundation, YOLOv8 aims to be the best yet in
spotting objects in pictures or videos, working to be faster and more correct than ever before. It is
good at many things, and one big plus is that it works well with all the older YOLO versions. This

20
makes it simple for users to try out different versions and see which one works the best, making it
a top option for people who want the newest YOLO tools but also want to use their older ones
(Spodarets & Rodriguez, 2023).
Figure 2. YOLO evolution (A Comprehensive Review of YOLO: From YOLOv1 to YOLOv8 and Beyond, 2023)
Faster R - CNN
Faster R-CNN was introduced by Ren et al., 2017, in their paper “Towards Real-Time Object
Detection with Region Proposal Networks”. It was developed to solve the computational
bottleneck in object detection systems caused by region proposal algorithms. Faster R-CNN is an
object detection algorithm that is widely used in computer vision tasks. It consists of two modules:
a deep, fully convolutional network that proposes regions and a Fast R-CNN detector that uses the
proposed regions. The Region Proposal Network (RPN) is a fully convolutional network that
predicts object bounds and objectness scores at each position. The RPN is trained to generate high-
quality region proposals, which are then used by the Fast R-CNN detector for object detection.
The RPN in Faster R-CNN enables nearly cost-free region proposals by sharing convolutional
features with the detection network. This reduces the computational bottleneck of region proposal
computation and allows the system to run at near real-time frame rates. On a more detailed level,
the backbone network, which in this case is Resnet, is used to extract features from the image.
These features are then fed into a Region Proposal Network (RPN) that generates region proposals,
potentially bounding box locations for objects (Yelisetty, 2020). Finally, a Box Head is used to
refine and tighten the bounding boxes.

21
Features of Faster R-CNN are the Region Proposal Network (RPN), unified network, efficient
region proposals, and improved localization accuracy. The RPN in Faster R-CNN significantly
improves the localization accuracy of object detection, especially at higher Intersection over Union
(IoU) thresholds. It achieves higher mean Average Precision (mAP) scores compared to the Fast
R-CNN system.
Figure 3. Faster R-CNN architecture (KHAZRI, n.d.)
Detectron2
Detectron2 is an object detection library developed by Facebook's AI research (FAIR) team. It is
a complete rewrite of the original Detectron, which was released in 2018. The library is built on
PyTorch and offers a modular and flexible design, allowing for fast training on single or multiple
GPU servers (Wu, 2019). Detectron2 supports a wide range of object detection models. These
models are pre-trained on the COCO Dataset and can be fine-tuned with custom datasets. Some of
the detection models available in the Detectron2 model zoo include Fast R-CNN, Faster R-CNN,
and Mask R-CNN. These models are designed to detect and classify objects in images or videos.
Detectron2 also offers features like network quantization and model conversion for optimized
deployment on cloud and mobile platforms (Yelisetty, 2020). With its speed, scalability, and
extensive model support, Detectron2 is a powerful tool for training object detection models. It

22
aims to advance machine learning by providing speedy training and addressing the challenges
faced when transitioning from research to production.
Detectron2 leverages the power of Faster R-CNN by providing a modular and extensible design.
This means that users can easily plug in custom module implementations into different parts of the
object detection system (Wu, 2019). This allows for easy experimentation and customization of
the models.
Important Object Detection Notions
Yolo Model Sizes
The YOLO (You Only Look Once) object detection model has different sizes for various use cases.
YOLOv5 offers different model sizes, allowing to choose the one that suits the specific
requirements. The sizes are typically referred to as small, medium, large, and extra-large. The size
of the model influences its speed and accuracy. Smaller models are faster but may not be as
accurate, while larger models offer better accuracy but may be slower (Jocher, 2021).
Loss Function
In the YOLO object detection system, the loss function is a composite function that measures the
differences between the predicted and true values for objectiveness, class predictions, and
bounding box coordinates. The loss function is designed to weigh errors in bounding box
coordinates and object detection more heavily than errors in class predictions. Generally, it
integrates mean squared error for bounding box predictions and logistic regression loss for class
predictions and objectness (Redmon et al., 2016).
In Faster R-CNN, the loss function, often referred to as multi-task loss, is a unified objective
function that merges the classification loss (usually log loss) and the bounding box regression loss
(usually smooth L1 loss). This approach allows the model to learn to classify objects and predict

23
bounding box coordinates concurrently, facilitating more accurate object detection (Ren et al.,
2017).
ReLU
ReLU is a type of activation function that is defined as f(x)=max (0, x). This function outputs the
input directly if it is positive; otherwise, it will output zero. It has become popular for its
computational efficiency and its ability to mitigate the vanishing gradient problem to some extent
(ReLu Definition, n.d.).
He Weight Initialization for Relu
He Weight Initialization is a weight initialization technique specifically designed for ReLU
activation functions. It addresses the issue of poor convergence in deep networks that can occur
with other initialization methods (He et al., 2015). He et al., 2015, proposed this technique, which
initializes the weights of ReLU layers using a variance of Var (Wi)=2/n, where n in is the number
of inputs to the layer. Adopting this approach ensures the consistency between input and output
gradient variances, thereby enhancing network convergence and operational efficacy. Typically,
prior to initiating network training, the ReLU layer weights are adjusted in alignment with this
variance formula. By initializing the weights in this way, the network is better able to learn and
optimize the ReLU activation function, leading to improved performance (Aguirre & Fuentes,
2019). Since in this technique the weights are set to numbers that are not too small or too big, it
helps to prevent the vanishing gradient problem. It ensures that the variances in the activations of
hidden units do not increase with depth, leading to better convergence in deep networks (He et al.,
2015). YOLO and Faster-RCNN which are utilized in this thesis, are prone to issues related to
vanishing or exploding gradients and both often employ ReLU (Rectified Linear Unit) activations
in their architectures. Proper weight initialization, such as He initialization, can ensure gradients
remain in a manageable range, leading to faster convergence during training.

24
Vanishing gradient problem
The vanishing gradient problem refers to the phenomenon where gradients of the loss function
become too small for the network to learn effectively, a problem often encountered in deep neural
networks with many layers. This occurs because the derivatives of the activation functions used in
the network become very small, causing the gradients to diminish as they are backpropagated
through the layers. This problem can impede the convergence of the network during training and
is one of the motivations for the development of alternative activation functions like ReLU (What
Is the Vanishing Gradient Problem? n.d.).
Spline Wavelet Transforms
Wavelet transforms are mathematical operations used for signal analysis and data compression.
They can be used for object detection by analyzing the frequency content of an image or signal.
The wavelet transform decomposes the image or signals into different frequency components,
allowing for the identification of specific patterns or features (Wavelet Transforms — GSL 2.7
Documentation, n.d.).
Wavelet transform is commonly used in deep learning architectures, specifically in convolutional
neural networks (CNNs). It can be integrated into CNN architectures to improve noise robustness
and classification accuracy. By replacing traditional down-sampling operations with wavelet
transform, the low-frequency component of the feature maps, which contains important object
structures, is preserved while the high-frequency components, which often contain noise, are
dropped (Li et al., 2020).
Based on the documentation of wavelet transforms (Wavelet Transforms — GSL 2.7
Documentation, n.d.), there are two approaches to how it can be used for object detection. One
approach is to apply the wavelet transform to the image or signal and then analyze the resulting
coefficients. By selecting the largest coefficients, which represent the most significant frequency
components, objects or features of interest can be identified. The remaining coefficients can be set
to zero, effectively compressing the data while retaining the important information.

25
Another approach is to use the wavelet transform as a feature extraction technique. By applying
the wavelet transform to different regions of an image or signal, features such as edges, textures,
or shapes can be extracted. These features can then be used for object detection or classification.
Heatmaps and Bounding Boxes
Some important notions that are used throughout this research are heatmaps and bounding boxes.
In object detection, combining heatmaps and bounding boxes often involves utilizing heatmaps to
identify the central points or key points of objects and then using additional geometric attributes
to construct bounding boxes around identified objects. Using heatmaps, the CNN can accurately
and efficiently locate and track objects in an image, improving its performance over time by having
clear indicators of the objects' positions (Haq et al., 2022).
A neural network is used to generate a heatmap where the intensity of each pixel represents the
likelihood of that pixel being a central point or a key point of an object. Heatmaps can represent
additional attributes like object width, length, or even orientation. Then, a peak detection algorithm
is used to detect peaks in the heatmaps indicating an object's center. In this research, the technique
of thresholding is used for such detections. It involves selecting a threshold value, and then setting
all pixel values less than the threshold to zero and all pixel values greater than or equal to the
threshold to one (or some other specified value). This essentially divides the image into regions of
interest based on the intensity values of the pixels. From the detected central points, the network
predicts the other attributes that are needed to create the bounding boxes.
Haq et al utilize the heatmap technique for object detection and tracking. The authors use a Fully
Convolutional Neural Network (CNN) to generate heatmaps representing each image's desired
output. These heatmaps are created by placing a Gaussian distribution around the object's location
coordinates obtained from a text file. The purpose of these heatmaps is to provide feedback to the
CNN and improve its performance by indicating how well it is currently detecting and tracking
the object.

26
Combining heatmaps and bounding boxes in this way allows for a more nuanced approach to
object detection, leveraging both the localization information provided by heatmaps and the
classificatory power of bounding boxes to achieve accurate object detection (Xu et al., n.d.).
Xu et al. is another paper that proposes a method that combines heatmaps and bounding boxes to
improve the detection performance in videos. Their method, called heatmap propagation, generates
a propagation heatmap by overlapping the heatmaps of detected objects from consecutive frames.
This propagation heatmap highlights the potential positions of each object's center with its
confidence score. Then, the paper combines the network output heatmap of the current frame with
the propagation heatmap to create a long-term heatmap. This long-term heatmap serves as a
temporal average and helps to compensate for low image quality and maintain stable detection.
The paper calculates a weighted size for bounding box prediction by combining the propagated
size information and the network output size. This weighted size is used to generate the final
bounding box prediction.
2.3 Aerial Image Detection Challenges
Aerial images are images that capture the Earth’s surface from above in different elevation levels
at a steep viewing angle. They are typically obtained from aircraft, drones, satellites, or other
airborne platforms. Due to the level of detail, they provide over a large spatial scale, they have
found usage in many scientific fields. Integrating AI technologies like deep learning and computer
vision with aerial imagery analysis has opened a new avenue in academic research, facilitating
more detailed and sophisticated analyses and predictions.
As with any other new technology, aerial images come with their own challenges, which are
subject to research. Identified challenges include arbitrary orientations, scale variations,
nonuniform object densities, and large aspect ratios (Ding et al., 2022). Capturing objects from
different angles and orientations due to the overhead view makes object detection more
challenging. Due to the variations in altitude, aerial images can vary significantly in size, which
brings the need for handling scale variations. Varying densities of objects across regions add to

27
the complexity of object detection, as does having objects with large aspect ratios like, for example,
long and narrow structures of ships or vehicles.
Another set of challenges that needs to be overcome to ensure accuracy and reliability is obtaining
high-resolution imagery. The presence of harsh weather conditions or buildings can obscure parts
of the image, as is the issue of lighting and shadows, which can make it difficult to distinguish
objects from their surroundings (Lu et al., 2023; Liu et al., 2023). Another challenge is the need
for real-time object detection, which requires fast and efficient algorithms (He et al., 2019). This
can be especially challenging when dealing with large datasets or complex environments. To
overcome these challenges, researchers are developing new algorithms and techniques that are
specifically designed for object detection in aerial images. These include deep learning algorithms,
feature extraction techniques, and advanced image processing methods.
Ding et al. address these challenges and limitations of object detection in aerial images. The paper
focuses on the lack of large-scale benchmarks for object detection in aerial images and presents a
comprehensive dataset called DOTA (Dataset of Object Detection in Aerial images), which
contains over 1.7 million instances annotated by oriented bounding boxes (OBBs). The paper also
provides baselines for evaluating different algorithms and a code library for object detection in
aerial images. The research aims to facilitate the development of robust algorithms and
reproducible research in this field. The method used in this paper is a combination of CNN OBB
and RoI (region of interest) Transformer. It evaluates ten algorithms, including Faster R-CNN and
over 70 models with different configurations using the DOTA-v2.0 dataset. It achieves an OBB
mAP (mean average precision) of 73.76 for the DOTA-v1.0 dataset. The method outperforms
previous state-of-the-art methods, except for the one proposed by Li et al, 2022. It also
incorporates rotation data augmentation during training and shows significant improvement in
densely packed small instances.

28
2.4 Real-Time Tracking and Detection on Edge Devices
"Edge devices" are hardware units that manage the transmission of data at the junction of two
networks, acting as gateways that moderate the flow of data. (Posey & Scarpati, n.d.). These
devices, which range from routers to multiplexers and integrated access devices, function to
streamline the communication between distinct networks. Training a CNN on edge devices refers
to using these devices to perform the complex tasks needed to train a CNN model. In the past, deep
learning models such as CNNs were typically trained in large data centers that had powerful GPUs.
However, due to the increasing need for immediate, on-device processing and the quick
advancements in hardware technology, there has been a growing preference to train these models
directly on edge devices (Ran, 2019).
Edge devices are important for allowing quick, real-time processing in object detection because
they are placed at key points where different networks meet. They handle data and do computing
jobs near where the data is coming from (like IoT devices) instead of using big, central data centers
(What Is Edge Computing, Architecture Difference and How Does Edge Computing Work? 2023).
The emergence of edge computing has brought about several advantages that are critical for
various applications mentioned based on Liang et al., 2022, and Amanatidis et al., 2023.
One of the key advantages of edge computing is its ability to minimize latency (Exploring the
Potential of Edge Computing: The Future of Data Processing and IoT, 2023). With edge
computing, data no longer needs to travel to a distant server for processing, which is particularly
useful for time-sensitive applications such as autonomous driving or real-time video analysis. This
is because the processing takes place closer to the data source, leading to faster response times and
greater efficiency.
Another significant advantage of edge computing is its ability to enhance privacy. Edge computing
can benefit applications dealing with sensitive data by keeping data on the device where it was
generated. This includes applications such as healthcare and finance, where data privacy is of
utmost importance. By ensuring that data remains on the device, edge computing can effectively
reduce the risk of data breaches and unauthorized access, which may occur during data
transmission.

29
Edge computing can also help reduce network load. With the processing taking place on the device
itself, only relevant data needs to be transmitted, thereby reducing the amount of data that needs
to be sent over the network. This can lead to more efficient use of network resources, which can
be particularly useful in situations where network capacity is limited.
Finally, edge computing can enhance operability. Since the system can continue to operate even
when network connectivity is lost or unreliable, applications can continue to function even in
difficult network conditions. This can be critical for applications that require constant uptime, such
as those used in industrial settings.
However, training deep learning models on edge devices also comes with challenges. These
devices often have limited computational resources and power, so models must be efficient and
lightweight. This has sparked interest in areas such as model compression, network pruning, and
efficient neural architecture search (Falk, 2022).
2.5 Evaluation Metrics and Benchmarking
Evaluation metrics and benchmarking in object detection are processes that work hand in hand to
assess and compare the performance of various object detection algorithms or models.
Evaluation metrics provide the quantitative foundation upon which benchmarking is carried out.
These metrics, which include parameters such as precision, recall, and mean average precision
(mAP), offer a detailed insight into the individual performance of a model in terms of its accuracy,
ability to detect objects under varying conditions, and efficiency in distinguishing between
different objects within a given dataset. By assessing models based on these metrics, researchers
and developers can understand the strengths and weaknesses of a model in a quantitative manner.
When assessing the effectiveness of various models, benchmarking is the preferred technique. This
methodology considers the evaluation metrics and entails a comparative examination of multiple

30
models to ascertain which one performs optimally in specific circumstances. Through
benchmarking, one can pinpoint the most appropriate model for a given task and fine-tune its
performance accordingly. This process helps in identifying the best-performing models based on
real-world testing scenarios and datasets and aids in fine-tuning existing models or developing
new models that are more efficient and accurate.
However, as (Padilla et al., 2021) highlight in their paper, there are challenges with the benchmark
and evaluation metrics, which vary significantly across different studies and can sometimes result
in confusing and potentially misleading comparisons. The identified challenges are differences in
bounding box representation formats and variations in performance assessment tools and metrics.
They continue to further their argument by the varying formats that different object detectors may
use to represent bounding boxes, such as using absolute coordinates or relative coordinates
normalized by the image size. This is important as later in the methodology of this research,
bounding boxes will be converted to fit YOLO and Detectron frameworks.
These variations in representation formats can make it difficult to compare and evaluate the
performance of different detectors.
As per the variations in performance assessment tools and metrics (Padilla et al., 2021) states that
each performance assessment tool implements a set of different metrics, requiring specific formats
for the ground-truth and detected bounding boxes. While there are tools available to convert
annotated boxes from one format to another, the lack of a tool compatible with different bounding
box formats and multiple metrics can lead to confusion and misleading comparative assessments.
To overcome the challenges in the benchmarking of datasets and evaluation metrics, the authors
have proposed the development and release of an open-source toolkit for object detection metrics
(Padilla et al., 2021). The toolkit includes various evaluation metrics, supports different bounding
box formats, and introduces a novel metric for video object detection. The metric considers the
motion speed of the objects and measures the average Intersection over Union (IOU) score between
the ground-truth objects and the detected objects in the current frame and nearby frames. This
metric provides a more comprehensive evaluation of video object detection performance compared
to traditional metrics like mean Average Precision (mAP).

31
Some of the evaluation metrics considered in their study are average precision, intersection over
union, mean average precision and average recall.
Definitions of the evaluation metrics in deep learning are:
● Average Precision (AP) measures the precision of object detection at different levels of
recall. It calculates the area under the precision-recall curve and provides an overall
measure of detection accuracy (Anwar, 2022). To compute the area under the Precision-
Recall curve, the curve is divided into smaller segments, typically trapezoids. The area of
each segment, which forms a trapezoid between two adjacent points, is calculated. The
total area under the curve is obtained by summing the areas of all segments. Finally, to
derive the Average Precision, the total area is divided by the total range of recall. This
normalization ensures that the metric is not dependent on the specific recall range of the
dataset.
Figure 4. Illustration of precision-recall curve (Phung,2020)

32
The resulting value provides an average assessment of how well the system performs across
various thresholds. It offers a comprehensive measure of the system's efficiency in ranking
relevant items (Precision-Recall — Scikit-Learn 1.3.1 Documentation, 2011).
The precision looks at the proportion of correctly predicted positive instances out of all
instances predicted as positive. The formula for the precision is:
On the other hand, recall measures the proportion of correctly predicted positive
instances out of all actual positive instances. The formula for the recall is:
The selection of Average Precision (AP) as the assessment metric in this thesis stems from its
widespread adoption and efficacy in appraising object detection performance. AP offers a
comprehensive evaluation of the model's capability to detect objects under varying conditions by
considering performance at different detection thresholds (Padilla et al., 2020). Furthermore, AP
grants recognition for situations in which a model identifies only a portion of an object. It merges
precision (the correctness of positive predictions) and recall (the proficiency to locate all pertinent
instances) into one metric. This holistic assessment proves vital in object detection scenarios where
both incorrect positive identifications and overlooked instances hold substantial importance.
Effective object detection mandates not only accurate categorization but also exact localization of
objects. Average Precision (AP) takes both factors into consideration, making sure that the model's
proficiency in accurately placing bounding boxes is evaluated (Garcia-Garcia et al., 2017).
● Intersection over Union (IOU) measures the overlap between the predicted bounding box
and the real bounding box (Anwar, 2022; Shah, 2022). It is a way to see how well the

33
predicted boundary box lines up with the actual or true position of the boundary box. It is
used to determine whether detection is correct based on a predefined threshold. The
threshold is like a benchmark or standard used to decide whether the prediction was good
enough. If the IoU value meets or goes above this threshold, then we say it is a "true
positive," meaning it was a correct prediction. However, if it does not meet this benchmark,
then it's seen as a "false positive," a wrong prediction. The choice of threshold depends on
the specific task and can vary based on the model's accuracy expectation. In object
detection tasks, the IoU threshold also helps find the precision, which tells how many
correct positive predictions (true positives) we got out of all the positive predictions made
(which includes both the true positives and the false positives). By changing the IoU
threshold, the researchers can control the level of precision they get: a higher threshold
gives lower precision but is stricter about the accuracy, and a lower threshold gives higher
precision, being more generous in accepting predictions. This helps tailor the detection tool
to be as strict or as lenient as researchers want it to be.
Figure 5. Intersection over Union (Anwar, 2022)
● To compute the mAP, one considers several factors, including the confusion matrix — a
table that provides insight into the algorithm's performance — as well as IoU, which gauges
the algorithm's accuracy in object detection. Additionally, recall (the number of correctly

34
identified actual positives) and precision (the number of correctly identified positives out
of all identified positives) are considered.
The mAP score not only examines the balance between precision and recall, seeking an
optimal equilibrium, but also addresses errors like false positives (incorrectly identifying
something as an object) and false negatives (failing to detect an actual object).
By accounting for all these facets, the mAP offers a comprehensive evaluation of the
algorithm's performance. This is why it is widely embraced by researchers in the field of
computer vision, serving as a reliable benchmark to gauge the robustness and dependability
of various object detection models.
According to (Shah, 2022) these are the steps to calculate mAP:
a. Generate prediction scores using the model.
b. Convert the prediction scores to class labels.
c. Calculate the confusion matrix, which includes true positives (TP), false positives (FP),
true negatives (TN), and false negatives (FN).
d. Calculate precision and recall metrics.
e. Calculate the area under the precision-recall curve.
f. Measure the average precision for each class.
g. Calculate the mAP by finding the average precision for each class and then averaging
over the number of classes.

35
Figure 6. mAP flowchart calculation (Hofesmann, 2020)
Figure 7. Steps how to calculate mAP (Anwar, 2022)
● The confusion matrix is a tool used to evaluate the performance of object detection models.
It consists of four attributes: True Positives (TP), True Negatives (TN), False Positives
(FP), and False Negatives (FN). TP represents correct predictions, TN represents correct
non-predictions, FP represents incorrect predictions, and FN represents missed predictions.
These attributes help assess the accuracy and reliability of the model's predictions.

36
Figure 8. Confusion Matrix (Shah, 2022)
● Average Recall (AR) measures the recall of object detection at different levels of precision.
It calculates the area under the precision-recall curve but focuses on recall instead of
precision.
2.6 Relevant Studies, Research Gaps and Future Directions
Numerous research initiatives have been carried out focusing on vehicle detection by applying
deep learning methods. These studies often incorporate aspects such as utilizing aerial images,
processing data in real-time, edge computing, and accounting for varying weather scenarios, while
also aiming to enhance existing algorithms with regards to speed and precision. However, this field
faces hurdles owing to the insufficient exploration of object detection under snowy settings. Based
on an extensive literature review, below are two recent studies that are more closely to the
investigation that our research is aiming for.
1. Li et al., 2023, aim to develop a deep learning-based target detection algorithm for aerial
vehicle identification. It focuses on the usage of small, low-altitude UAVs for aerial vehicle
detection. It addresses the challenge of precise and real-time vehicle detection in UAVs,

37
emphasizing that most vehicle targets have few feature points and small sizes, making
detection difficult. The study customizes the YOLOv5 model by introducing additional
prediction heads to detect smaller-scale objects and incorporates a Bidirectional Feature
Pyramid Network. As a prediction frame filtering method, they employ Soft non-maximum
suppression. The experiments are conducted on a self-made dataset to evaluate the
performance of the customized YOLOv5-VTO, which they compare it with the
performance of the baseline YOLOv5s in terms of mean average precision at different IoU
thresholds. The results show improvements in mAP@0.5 and mAP@0.5:0.95, as well as
accuracy and recall. This study does not include the element of weather conditions and
edge-device processing.
2. Bulut et al., 2023 brings several new contributions to the research on object detection
models for traffic safety applications on edge devices. This study evaluates the
performance of the latest models, including YOLOv5-Nano, YOLOX-Nano, YOLOX-
Tiny, YOLOv6-Nano, YOLOv6-Tiny, and YOLOv7-Tiny on an embedded system-on-
module device that is known for performing edge computing tasks. The study does not
document the parameters of the edge device but mentions the key metrics for performance
comparison such as precision, power consumption, memory usage and inference time. The
study uses a dataset consisting of 11,129 images taken from highway surveillance cameras.
The dataset was shuffled to prevent models from memorizing sequential scenes. Average
precision (AP) was used as the key metric for performance evaluation, which measures
how many boxes are detected correctly out of the total number of detected boxes for a
given object. The models were tested with three consecutive runs for each metric, and the
final value was calculated using the average of the runs. The experimental results showed
that the YOLOv5-Nano and YOLOv6-Nano models were the strongest candidates for real-
time applicability, with the lowest energy consumption and fewer parameters. The
YOLOv6-Tiny model achieved the highest precision values, making it suitable for
applications where accuracy is critical. Again, this study is missing the element of weather
condition what our study is aiming to include.
Despite the undeniable improvements and contributions in the capabilities of deep learning in
object detections there is still room for further optimizations. To achieve this, future research could

38
explore novel optimization techniques and fine-tune the training process to boost the performance
of models. Additionally, the use of different hardware and software configurations should be
evaluated to assess their impact on model applicability in real-life scenarios. Another area that
requires further investigation is the performance of different models or variations of existing
models. Researchers should aim to identify potential improvements or alternatives that could
enhance the effectiveness of these models in real-life applications. Furthermore, addressing
resource efficiency challenges in object detection models is crucial. This could involve developing
lightweight architectures or optimizing existing models to reduce computational power
requirements while maintaining accuracy and real-time applicability. Finally, is the clear effect of
weather conditions, especially snow, on the quality of images and, as a result, how well detection
models work. Current research struggles with a noticeable lack of data about snowy conditions,
showing a strong need to include weather factors in future studies. Adding this information would
not only widen the research area but also be a step towards creating a model that remains strong
and unaffected by the challenges of bad weather.

39
CHAPTER 3
Methodology
3. Thesis Methodology Approach
3.1 Dataset Acquisition and Description
As this thesis builds upon Mokayed et al., 2023, the data capturing, and preparation are well
documented in their paper. Here is a summary of the process. The dataset on which all models are
trained is called the Nordic vehicle dataset (NVD). The dataset has videos/images captured from
UAV, containing vehicles in random areas and with different snow coverage. It was captured using
a "Smart planes Freya unmanned aircraft system" drone.
The dataset is of a heterogeneous nature, including different ranges of snow coverage, disparities
in resolution, and aerial views from varying altitudes ranging between 120 to 250 meters. The
videos, captured at a speed of 13m/s (47 km/h), are characterized by different features, including
varied resolutions and stabilization statuses; not all are annotated and stabilized yet. The Smart
planes Freya drone facilitated the data collection. Its technical features, which include a wingspan
of 120 cm and a cruise speed of 13m/s, were essential in the acquisition of high-quality data. The
UAV’s camera specifications are equally noteworthy, having a sensor of 1.0-type (13.2 x 8.8 mm)
Eximor RS CMOS and capable of recording videos at 1080p or 4K at 25 frames per second,
lending depth and clarity to the recorded values.

40
Figure 9. Showcase of dataset images
Despite the diversity in the data, a selection was made that focused on a process to narrow down
the usable data to stabilized and annotated videos that conform to a 1920X1080 resolution and
with snow coverage ranging between 0-10 cm, as described in detail in Table.3. The selection
process was attentively done, leading to the selection of 5 videos specifically for the purposes of
this thesis, out of the available 22, which were further divided into training and testing sets and all
selected models were trained on the same split. This methodological decision was made to ensure
a fair comparison of the models, eliminating any potential influence from differences in the
dataset's quality or inconsistencies in the selection and division of videos. Mokayed et al. used
this specific subset and split to maintain consistency across our Convolutional Neural Network
(CNN) and their respective models.
A significant part of the data preparation involved the extraction of individual frames from the
chosen videos, each accompanied by a respective file featuring bounding box annotations, curated
through the CVAT tool as referenced in the Hama et al. paper. These annotations are further
transformed to accommodate YOLO model prerequisites. The network was trained using heatmaps
for our Convolutional Neural Network (CNN) model, which necessitated generating them from
the specific frames and corresponding annotation files.
Table 1. Specification of Freya unmanned aircraft Mokayed et al.

41
Table 2. Specification of Freya unmanned aircraft Mokayed et al.
3.2 Data Splitting
In this research, we have gathered all the annotated videos from the NVD. These videos have a
resolution of 1920x1080 and depict various flight altitudes, snow covers, and cloud conditions.
We have split the dataset according to the same methodology used by Mokayed et al., as shown in
the Table 3. This enables us to train and benchmark our model on the same set of images.
The dataset has been split so that the training and testing sets have an equal distribution of snow a
covers and flight altitudes. This allows for a balanced performance evaluation of our model. The
train/test split is around 72:28. However, we have further split the training set by creating a
validation set. This set follows an 80:20 scale of training to validation. Therefore, the actual split
is 57:14:28 (train/validation/test). Videos are also not split between datasets; they either are a part
of a training dataset or a testing dataset. This ensures that the model never previously saw cars
from the testing dataset.
The training set is used to train the model, and the model sees the labels during this process. The
validation set is used after training in each epoch. The model does not see the labels directly but
indirectly, as we use it for benchmarking between model iterations during training, and all model
parameters are adjusted accordingly. It also helps us not to overfit the model. Lastly, the testing
set is used for benchmarking between different models (such as YOLO, Fast-R CNN, and CNN)
and is used after training. The model does not see any labels from the testing set, even indirectly.
Overall, the proposed approach ensures a fair and unbiased evaluation of this research model's
performance.

42
Table 3. Summary of NVD dataset processing
3.3 Frame and Annotation Processing
The process of frame selection and annotation is pivotal to the detection model functionality as the
precision and the performance are tied to the quality and the appropriateness of its training data
(Liu et al., 2016; Ma et al., 2022).
For model training, it is essential to extract individual annotated images from the annotated videos
as direct video import into the network is incompatible. An annotation file, typically in .xml
format, provides a detailed record of each vehicle across all frames in which it appears. Each
annotation indicates the vehicle's location through the top-left and bottom-right coordinates of its
bounding box, accompanied by its rotation (Russakovsky et al., 2015).
The first step involves segmenting the video into individual frames. Subsequently, an individual
annotation file is generated for every frame derived from the video. Prior to this segmentation,
there was a need to convert the bounding box annotation to be compatible with YOLO and
Detectron-styled format.
For YOLO models, the annotation format is modified to include the top-left coordinate along with
the width and height of the bounding box. Importantly, these coordinates are not absolute values
but in proportion to the size of the picture. For instance, within an image measuring 1000x1000
pixels, a coordinate that originally reads as [200,200] would be represented as [0.2,0.2] (Redmon
et al., 2016).

43
On the other hand, the Faster-RCNN model requires absolute coordinates, but we transform them
into top-left and bottom-right coordinates. It may seem like the original annotation, but the original
annotation had rotation, so it was capturing the car more precisely. Consequently, the bounding
boxes tend to be larger with the rotation parameter discarded in the converted annotation. Although
this implies a less precise encapsulation of the object, the models mentioned require such a format
(Detectron2.data. transforms — Detectron2 0.6 Documentation, n.d.).
Figure 10. Difference between inner bounding boxes with rotation and outer bounding boxes without rotation
When it comes to our CNN model, we have diverged from the traditional approach of using
bounding boxes as labels; instead, we propose the utilization of heatmaps. The heatmaps serve as
single-channel images with increased intensity in areas where the object (vehicle) is present and
lower or 0 intensity in areas where the object is not present (Huang et al., 2022). The generation
of these heatmaps is done through the application of a Gaussian elliptical function. The Gaussian
function has been selected to facilitate a gradual increase in pixel intensity (a proxy for the
likelihood of the presence of a vehicle) as one moves toward the central region of the car. This
feature ensures smoothness for the gradient descent during the testing process (Huang et al., 2022).
Considering the rectangular shape of cars, we opted for an elliptical function, which entails setting
one dimension of the Gaussian function with a higher sigma value compared to the other. This
approach allows us to better represent the vehicles' shape in the analysis. Furthermore, we rotated
this elliptical Gaussian function using values derived from the original annotations to ensure an
optimal fit to the vehicle's orientation.

44
Figure 11. Illustration of gaussian elliptical function over heatmap (Probability - How to Get a "Gaussian
Ellipse",2020)
To accommodate images featuring multiple vehicles, we devised a method where individual
Gaussian elliptical functions representing each vehicle are normalized to a scale ranging from 0 to
1 and do an addition of all functions. However, we have identified a challenge with this technique
as when vehicles in the image are positioned closely to each other, the Gaussian functions overlap
significantly, leading to higher peaks in the heatmap and potentially overshadowing vehicles
positioned farther apart. To make this right, we refined our method by reducing the radius of
functions by reducing sigma values, reducing the extent to which individual Gaussian functions
influence each other. This modification reduces the problem to just a few samples on the whole
dataset.
The resulting heatmap is normalized again to <0,1> scale with the precision of a data type float32.
Increasing the precision into the datatype of float64 would “smoothen '' the curve, but it is not
feasible for training due to the marginal increase in training of a model and our model structure.
To retaining the precision of the coordinates or pixels, we store the heatmaps as numpy arrays
rather than a .png image. While we do maintain a repository of heatmap representations in .png
format, it is imperative to note that these are for illustrative purposes and are not utilized in the
training process. As a result of this process, we also have a separate annotation file with the
required formats and a heatmap for each image in the dataset.

45
Figure 12. From left to right: original image, produced heatmaps, overlay of both
3.4 Training Techniques, Models
The training techniques differ for every model type, and thus, the process of each training is
explained separately.
For retraining of YOLO models, an official Ultralytics library was used (Ultralytics YOLOv8 Docs,
n.d.), where the images were fed into the network with its corresponding YOLO formatted
bounding box annotations. The “s” size models were chosen as they are the largest models the
edge devices upon which this research is based could handle. Before initiating YOLO model
training, data augmentation on the images was applied, a strategy implemented to expand the
applicability and improve the precision of the model. The augmentation process occurred during
online training, leveraging both the YOLO’s inherent augmentation feature and the
Albumentations library.
Figure 13. Example of Yolo model sizes (Train Custom Data, 2023)

46
To facilitate the retraining of Detectron models, the recognized Detectron2 library was utilized,
facilitating the integration of images and corresponding Detectron formatted bounding box
annotations into the system (Sethi, n.d.). Official model weights of “faster_rcnn_R_50_FPN_3x”
from detectron were used as a baseline for retraining on the NVD dataset. The model training and
hyperparameter tuning were undertaken in collaboration with the Mokayed NVD team to compare
the performance and add to their research with our research on edge devices. For YOLO models,
image augmentation techniques were also used according to Mokayed et al., 2023 and codebase
specifications using Albumentations library. In terms of training our proprietary CNN model, there
was no pre-defined framework or documentation to guide the process. Primarily, the PyTorch
library was utilized to facilitate the development of a customized training methodology.
At first, the CNN architecture that can tackle this problem was devised:
Figure 14. CNN network architecture made as pytorch nn.Module.

47
The architecture of the network follows a so-called encoder-decoder framework. Initially,
convolutional layers functioning as encoders decrease spatial dimensions and facilitate feature
extraction. Subsequently, upsampling layers, also known as transpose layers, assume the role of
decoders, which increase the spatial dimensions and generate output.
The convolution operation is first applied, followed by ReLU activation on the input data. After
each convolution operation, a max pooling layer is performed to reduce the spatial dimensions of
the feature maps by a factor of 2. This cycle is repeated four times. The ReLU activation function
is utilized due to its ability to mitigate the vanishing gradient problem and accelerate training.
In the upsampling layer, a series of transpose convolutions are used. They are slightly different
from the classical upsample layer, which has a series of simple interpolations. The transpose layer
also has trainable parameters, which could improve the performance. These layers reduce the
channels and increase spatial dimensions, which in the end generates our predicted heatmap.
Figure 15. Our CNN network architecture

Real Time Vehicle Detection for Intelligent Transportation Systems

Real Time Vehicle Detection for Intelligent Transportation Systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Real Time Vehicle Detection for Intelligent Transportation Systems

Similar to Real Time Vehicle Detection for Intelligent Transportation Systems (20)

Recently uploaded

Recently uploaded (20)

Real Time Vehicle Detection for Intelligent Transportation Systems