This thesis aims to evaluate the effectiveness of commonly used object detection methods like YOLO v8, YOLO v5, and Faster R-CNN in identifying vehicles in images captured by drones under varying degrees of snow cover. It also explores data augmentation techniques and edge devices' real-time tracking capabilities when applied to aerial images in snowy conditions. The goal is to contribute insights into challenges of vehicle detection in snow and suggest improvements to accuracy and reliability of detection systems in adverse weather.
Detecting fraud with Python and machine learningwgyn
- Machine learning models are used to detect fraud by estimating the probability of fraud given transaction features.
- Building and updating fraud detection models involves significant work in feature engineering, model training, evaluation, and monitoring in production.
- Debugging a model that was performing poorly revealed an important predictive feature - whether a customer's email address was provided - that improved the model once incorporated.
Computer vision has received great attention over the last two decades.
This research field is important not only in security-related software but also in the advanced interface between people and computers, advanced control methods, and many other areas.
Face recognition technology uses machine learning algorithms to identify or verify a person's identity from digital images or video frames. The process involves detecting faces, applying preprocessing techniques like filtering and scaling, training classifiers using labeled face images, and then classifying new faces. Common machine learning algorithms used include K-nearest neighbors, naive Bayes, decision trees, and locally weighted learning. The proposed system detects faces, builds a tabular dataset from pixel values, trains classifiers, and evaluates performance on a test set. Software applies techniques like detection, alignment, normalization, and matching to encode faces for comparison. Face recognition has advantages like convenience and low cost, and applications in security, banking, and more.
Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.
Gabor filter is a powerful way to enhance biometric images like fingerprint images in order to extract correct features from these images, Gabor filter used in extracting features directly asin iris images, and sometimes Gabor filter has been used for texture analysis. In fingerprint images The even symmetric Gabor filter is contextual filter or multi-resolution filter will be used to enhance fingerprint imageby filling small gaps (low-pass effect) in the direction of the ridge (black regions) and to increase the discrimination between ridge and valley (black and white regions) in the direction, orthogonal to the ridge, the proposed method in applying Gabor filter on fingerprint images depending on translated fingerprint image into binary image after applying some simple enhancing methods to partially overcome time consuming problem of the Gabor filter.
The document describes a project that aims to develop a mobile application for real-time object and pose detection. The application will take in a real-time image as input and output bounding boxes identifying the objects in the image along with their class. The methodology involves preprocessing the image, then using the YOLO framework for object classification and localization. The goals are to achieve high accuracy detection that can be used for applications like vehicle counting and human activity recognition.
The document describes two feature extraction methods: attention based and statistics based. The attention based method models how human vision finds salient regions using an architecture that decomposes images into channels and creates image pyramids, then combines the information to generate saliency maps. This method was applied to face recognition but had problems with pose and expression changes. The statistics based method aims to select a subset of important features using criteria based on how well the features represent the original data.
Detecting fraud with Python and machine learningwgyn
- Machine learning models are used to detect fraud by estimating the probability of fraud given transaction features.
- Building and updating fraud detection models involves significant work in feature engineering, model training, evaluation, and monitoring in production.
- Debugging a model that was performing poorly revealed an important predictive feature - whether a customer's email address was provided - that improved the model once incorporated.
Computer vision has received great attention over the last two decades.
This research field is important not only in security-related software but also in the advanced interface between people and computers, advanced control methods, and many other areas.
Face recognition technology uses machine learning algorithms to identify or verify a person's identity from digital images or video frames. The process involves detecting faces, applying preprocessing techniques like filtering and scaling, training classifiers using labeled face images, and then classifying new faces. Common machine learning algorithms used include K-nearest neighbors, naive Bayes, decision trees, and locally weighted learning. The proposed system detects faces, builds a tabular dataset from pixel values, trains classifiers, and evaluates performance on a test set. Software applies techniques like detection, alignment, normalization, and matching to encode faces for comparison. Face recognition has advantages like convenience and low cost, and applications in security, banking, and more.
Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.
Gabor filter is a powerful way to enhance biometric images like fingerprint images in order to extract correct features from these images, Gabor filter used in extracting features directly asin iris images, and sometimes Gabor filter has been used for texture analysis. In fingerprint images The even symmetric Gabor filter is contextual filter or multi-resolution filter will be used to enhance fingerprint imageby filling small gaps (low-pass effect) in the direction of the ridge (black regions) and to increase the discrimination between ridge and valley (black and white regions) in the direction, orthogonal to the ridge, the proposed method in applying Gabor filter on fingerprint images depending on translated fingerprint image into binary image after applying some simple enhancing methods to partially overcome time consuming problem of the Gabor filter.
The document describes a project that aims to develop a mobile application for real-time object and pose detection. The application will take in a real-time image as input and output bounding boxes identifying the objects in the image along with their class. The methodology involves preprocessing the image, then using the YOLO framework for object classification and localization. The goals are to achieve high accuracy detection that can be used for applications like vehicle counting and human activity recognition.
The document describes two feature extraction methods: attention based and statistics based. The attention based method models how human vision finds salient regions using an architecture that decomposes images into channels and creates image pyramids, then combines the information to generate saliency maps. This method was applied to face recognition but had problems with pose and expression changes. The statistics based method aims to select a subset of important features using criteria based on how well the features represent the original data.
Yinyin Liu presents a model for object detection and localization, called Fast-RCNN. She will show how to introduce a ROI pooling layer into neon, and how to add the PASCAL VOC dataset to interface with model training and inference. Lastly, Yinyin will run through a demo on how to apply the trained model to detect new objects.
Real Time Object Dectection using machine learningpratik pratyay
This document discusses the development of a real-time object detection system using computer vision techniques. It aims to recognize and label moving objects in video streams from monitoring cameras with high accuracy and in a short amount of time. The system will use a hybrid model of convolutional neural networks and support vector machines for feature extraction and classification of objects from camera feeds into predefined classes. It is intended to help analyze surveillance video by only flagging clips that contain objects of interest like people or vehicles, reducing wasted storage and review time.
Online fraud costs the global economy more than $400 billion, with more than 800 million personal records stolen in 2013 alone. Increasingly, fraud has diversified to different digital channels, including mobile and online payments, creating new challenges as innovative fraud patterns emerge. Hence it is still a challenge to find effective methods to mitigate fraud. Existing solutions include simple if-then rules and classical machine learning algorithms.
From an academic perspective, credit card fraud detection is a standard classification problem, in which historical transaction data is used to predict future frauds. However, practical aspects make the problem more complex. Indeed, existent comparison measures lack a realistic representation of monetary gains and losses, which is necessary for effective fraud detection. Moreover, there is an enormous amount of transactions from which only a tiny part are frauds, which implies a huge class imbalance. Additionally, a real fraud detection system is required to give a response in milliseconds. This criterion needs to be taken into account in the modeling process in order for the system to be successfully implemented. To solve these problems, in this presentation two recently proposed algorithms are compared: Bayes minimum risk and example-dependent cost-sensitive decision tree. These methods are compared with state of the art algorithms and shows significant improvements measured by financial savings.
The document discusses object tracking in computer vision. It begins with an introduction and overview of applications of object tracking. It then discusses object representation, detection, tracking algorithms and methodologies. It compares different tracking methods and provides an example of object tracking in MATLAB. Key steps in object tracking include object detection, tracking the detected objects across frames using algorithms like point tracking, kernel tracking and silhouette tracking. Common challenges with object tracking are also summarized.
This document presents a fast algorithm for license plate detection. It begins with an introduction that outlines the need for automatic license plate recognition systems. It then discusses previous work in the area and the challenges involved. The proposed technique is divided into four main parts: histogram equalization, removal of border and background, image segmentation, and license plate detection using feature extraction, principal component analysis, and artificial neural networks. Test results on a dataset of 30 images achieved a 93.33% detection rate. Future work involves implementing the neural network classifier on an FPGA for increased speed.
Deep learning and neural networks are inspired by biological neurons. Artificial neural networks (ANN) can have multiple layers and learn through backpropagation. Deep neural networks with multiple hidden layers did not work well until recent developments in unsupervised pre-training of layers. Experiments on MNIST digit recognition and NORB object recognition datasets showed deep belief networks and deep Boltzmann machines outperform other models. Deep learning is now widely used for applications like computer vision, natural language processing, and information retrieval.
We will use 7 emotions namely - We have used 7 emotions namely - 'Angry', 'Disgust'濫, 'Fear', 'Happy', 'Neutral', 'Sad'☹️, 'Surprise' to train and test our algorithm using Convolution Neural Networks.
This document presents a proposed methodology for offline signature recognition using global and grid features extracted from signature images. The methodology involves preprocessing signatures, extracting global and grid features using discrete wavelet transforms, training a backpropagation neural network on the features, and classifying signatures based on the trained network. Experimental results show classification accuracy rates ranging from 89-93% for signatures from 10 to 50 individuals. Future work could involve exploring different signature features to potentially improve recognition performance.
Outlier analysis is used to identify outliers, which are data objects that are inconsistent with the general behavior or model of the data. There are two main types of outlier detection - statistical distribution-based detection, which identifies outliers based on how far they are from the average statistical distribution, and distance-based detection, which finds outliers based on how far they are from other data objects. Outlier analysis is useful for tasks like fraud detection, where outliers may indicate fraudulent activity that is different from normal patterns in the data.
Automatic Attendance system using Facial RecognitionNikyaa7
It is a boimetric based App,which is gradually evolving in the universal boimetric solution with a virtually zero effort from the user end when compared with other boimetric options.
Breast Cancer Diagnosis using a Hybrid Genetic Algorithm for Feature Selection based on Mutual Information (Abeer Alzubaidi, Georgina Cosma, David Brown and Graham Pockley)
Interactive Technologies and Games (ITAG) Conference 2016
Health, Disability and EducationDates: Wednesday 26 October 2016 - Thursday 27 October 2016 Location: The Council House, NG1 2DT
A comprehensive tutorial on Convolutional Neural Networks (CNN) which talks about the motivation behind CNNs and Deep Learning in general, followed by a description of the various components involved in a typical CNN layer. It explains the theory involved with the different variants used in practice and also, gives a big picture of the whole network by putting everything together.
Next, there's a discussion of the various state-of-the-art frameworks being used to implement CNNs to tackle real-world classification and regression problems.
Finally, the implementation of the CNNs is demonstrated by implementing the paper 'Age ang Gender Classification Using Convolutional Neural Networks' by Hassner (2015).
- Image classification involves training a classifier on labeled images, validating hyperparameters, and testing on unlabeled images.
- Nearest neighbor classification predicts labels of nearest training examples while linear classification learns weights to separate classes with a hyperplane.
- Loss functions like cross-entropy measure how well the classifier's predicted scores match the true labels and are minimized during training.
Credit Card Fraud Detection Using Unsupervised Machine Learning AlgorithmsHariteja Bodepudi
This document summarizes a research paper that uses unsupervised machine learning algorithms to detect credit card fraud. It describes how credit card fraud has increased with the rise of online shopping and payments. Unsupervised algorithms are well-suited for this task since labeled fraud data can be difficult to obtain. The paper tests Isolation Forest, Local Outlier Factor, and One Class SVM on a credit card transaction dataset to find anomalies (fraudulent transactions). Isolation Forest achieved the highest accuracy at 99.74%, slightly outperforming Local Outlier Factor, while One Class SVM had much lower accuracy. The paper concludes unsupervised algorithms are effective for anomaly detection tasks like credit card fraud detection.
Introduction to image processing and pattern recognitionSaibee Alam
this power point presentation provides a brief introduction to image processing and pattern recognition and its related research papers including conclusion
The document discusses object recognition in computer vision. It begins with an overview of object recognition, describing it as the task of finding and identifying objects in images. It then discusses several specific applications of object recognition, including fingerprint recognition and license plate recognition. Fingerprint recognition involves extracting features called minutiae from fingerprint images, which are ridge endings and bifurcations. License plate recognition uses an ALPR system to segment character images, normalize them, and recognize the characters.
The document discusses perception in artificial intelligence. It defines perception as acquiring, interpreting, and organizing sensory information. Perception involves both sensation, where sensors convert signals into data, and higher-level processes that make sense of the data. The document then discusses challenges in perception like abstraction and uncertainty in relations. It also notes perception is influenced by both internal and external factors beyond just signals.
Person Detection in Maritime Search And Rescue OperationsIRJET Journal
This document discusses recent research on using computer vision and machine learning techniques for person detection in maritime search and rescue operations from images and video captured by drones. Specifically, it summarizes 12 research papers on this topic, covering approaches such as training convolutional neural networks on bird's eye view datasets to detect people from aerial images, using multiple detection methods like sliding windows and precise localization, combining data from multiple drones and sensors to optimize search efforts, and evaluating models on both RGB and thermal image datasets. The goal of this research is to automate part of the search process to make maritime rescue operations more efficient and effective.
Yinyin Liu presents a model for object detection and localization, called Fast-RCNN. She will show how to introduce a ROI pooling layer into neon, and how to add the PASCAL VOC dataset to interface with model training and inference. Lastly, Yinyin will run through a demo on how to apply the trained model to detect new objects.
Real Time Object Dectection using machine learningpratik pratyay
This document discusses the development of a real-time object detection system using computer vision techniques. It aims to recognize and label moving objects in video streams from monitoring cameras with high accuracy and in a short amount of time. The system will use a hybrid model of convolutional neural networks and support vector machines for feature extraction and classification of objects from camera feeds into predefined classes. It is intended to help analyze surveillance video by only flagging clips that contain objects of interest like people or vehicles, reducing wasted storage and review time.
Online fraud costs the global economy more than $400 billion, with more than 800 million personal records stolen in 2013 alone. Increasingly, fraud has diversified to different digital channels, including mobile and online payments, creating new challenges as innovative fraud patterns emerge. Hence it is still a challenge to find effective methods to mitigate fraud. Existing solutions include simple if-then rules and classical machine learning algorithms.
From an academic perspective, credit card fraud detection is a standard classification problem, in which historical transaction data is used to predict future frauds. However, practical aspects make the problem more complex. Indeed, existent comparison measures lack a realistic representation of monetary gains and losses, which is necessary for effective fraud detection. Moreover, there is an enormous amount of transactions from which only a tiny part are frauds, which implies a huge class imbalance. Additionally, a real fraud detection system is required to give a response in milliseconds. This criterion needs to be taken into account in the modeling process in order for the system to be successfully implemented. To solve these problems, in this presentation two recently proposed algorithms are compared: Bayes minimum risk and example-dependent cost-sensitive decision tree. These methods are compared with state of the art algorithms and shows significant improvements measured by financial savings.
The document discusses object tracking in computer vision. It begins with an introduction and overview of applications of object tracking. It then discusses object representation, detection, tracking algorithms and methodologies. It compares different tracking methods and provides an example of object tracking in MATLAB. Key steps in object tracking include object detection, tracking the detected objects across frames using algorithms like point tracking, kernel tracking and silhouette tracking. Common challenges with object tracking are also summarized.
This document presents a fast algorithm for license plate detection. It begins with an introduction that outlines the need for automatic license plate recognition systems. It then discusses previous work in the area and the challenges involved. The proposed technique is divided into four main parts: histogram equalization, removal of border and background, image segmentation, and license plate detection using feature extraction, principal component analysis, and artificial neural networks. Test results on a dataset of 30 images achieved a 93.33% detection rate. Future work involves implementing the neural network classifier on an FPGA for increased speed.
Deep learning and neural networks are inspired by biological neurons. Artificial neural networks (ANN) can have multiple layers and learn through backpropagation. Deep neural networks with multiple hidden layers did not work well until recent developments in unsupervised pre-training of layers. Experiments on MNIST digit recognition and NORB object recognition datasets showed deep belief networks and deep Boltzmann machines outperform other models. Deep learning is now widely used for applications like computer vision, natural language processing, and information retrieval.
We will use 7 emotions namely - We have used 7 emotions namely - 'Angry', 'Disgust'濫, 'Fear', 'Happy', 'Neutral', 'Sad'☹️, 'Surprise' to train and test our algorithm using Convolution Neural Networks.
This document presents a proposed methodology for offline signature recognition using global and grid features extracted from signature images. The methodology involves preprocessing signatures, extracting global and grid features using discrete wavelet transforms, training a backpropagation neural network on the features, and classifying signatures based on the trained network. Experimental results show classification accuracy rates ranging from 89-93% for signatures from 10 to 50 individuals. Future work could involve exploring different signature features to potentially improve recognition performance.
Outlier analysis is used to identify outliers, which are data objects that are inconsistent with the general behavior or model of the data. There are two main types of outlier detection - statistical distribution-based detection, which identifies outliers based on how far they are from the average statistical distribution, and distance-based detection, which finds outliers based on how far they are from other data objects. Outlier analysis is useful for tasks like fraud detection, where outliers may indicate fraudulent activity that is different from normal patterns in the data.
Automatic Attendance system using Facial RecognitionNikyaa7
It is a boimetric based App,which is gradually evolving in the universal boimetric solution with a virtually zero effort from the user end when compared with other boimetric options.
Breast Cancer Diagnosis using a Hybrid Genetic Algorithm for Feature Selection based on Mutual Information (Abeer Alzubaidi, Georgina Cosma, David Brown and Graham Pockley)
Interactive Technologies and Games (ITAG) Conference 2016
Health, Disability and EducationDates: Wednesday 26 October 2016 - Thursday 27 October 2016 Location: The Council House, NG1 2DT
A comprehensive tutorial on Convolutional Neural Networks (CNN) which talks about the motivation behind CNNs and Deep Learning in general, followed by a description of the various components involved in a typical CNN layer. It explains the theory involved with the different variants used in practice and also, gives a big picture of the whole network by putting everything together.
Next, there's a discussion of the various state-of-the-art frameworks being used to implement CNNs to tackle real-world classification and regression problems.
Finally, the implementation of the CNNs is demonstrated by implementing the paper 'Age ang Gender Classification Using Convolutional Neural Networks' by Hassner (2015).
- Image classification involves training a classifier on labeled images, validating hyperparameters, and testing on unlabeled images.
- Nearest neighbor classification predicts labels of nearest training examples while linear classification learns weights to separate classes with a hyperplane.
- Loss functions like cross-entropy measure how well the classifier's predicted scores match the true labels and are minimized during training.
Credit Card Fraud Detection Using Unsupervised Machine Learning AlgorithmsHariteja Bodepudi
This document summarizes a research paper that uses unsupervised machine learning algorithms to detect credit card fraud. It describes how credit card fraud has increased with the rise of online shopping and payments. Unsupervised algorithms are well-suited for this task since labeled fraud data can be difficult to obtain. The paper tests Isolation Forest, Local Outlier Factor, and One Class SVM on a credit card transaction dataset to find anomalies (fraudulent transactions). Isolation Forest achieved the highest accuracy at 99.74%, slightly outperforming Local Outlier Factor, while One Class SVM had much lower accuracy. The paper concludes unsupervised algorithms are effective for anomaly detection tasks like credit card fraud detection.
Introduction to image processing and pattern recognitionSaibee Alam
this power point presentation provides a brief introduction to image processing and pattern recognition and its related research papers including conclusion
The document discusses object recognition in computer vision. It begins with an overview of object recognition, describing it as the task of finding and identifying objects in images. It then discusses several specific applications of object recognition, including fingerprint recognition and license plate recognition. Fingerprint recognition involves extracting features called minutiae from fingerprint images, which are ridge endings and bifurcations. License plate recognition uses an ALPR system to segment character images, normalize them, and recognize the characters.
The document discusses perception in artificial intelligence. It defines perception as acquiring, interpreting, and organizing sensory information. Perception involves both sensation, where sensors convert signals into data, and higher-level processes that make sense of the data. The document then discusses challenges in perception like abstraction and uncertainty in relations. It also notes perception is influenced by both internal and external factors beyond just signals.
Person Detection in Maritime Search And Rescue OperationsIRJET Journal
This document discusses recent research on using computer vision and machine learning techniques for person detection in maritime search and rescue operations from images and video captured by drones. Specifically, it summarizes 12 research papers on this topic, covering approaches such as training convolutional neural networks on bird's eye view datasets to detect people from aerial images, using multiple detection methods like sliding windows and precise localization, combining data from multiple drones and sensors to optimize search efforts, and evaluating models on both RGB and thermal image datasets. The goal of this research is to automate part of the search process to make maritime rescue operations more efficient and effective.
Person Detection in Maritime Search And Rescue OperationsIRJET Journal
1) The document discusses using machine learning and computer vision techniques for person detection in maritime search and rescue operations using drones/UAVs. It aims to automatically detect people in images/videos captured by drones to help with search efforts.
2) A key challenge is that people appear small in drone footage and are often obscured by vegetation or terrain. The models need to be trained on similar bird's eye view data to achieve high accuracy. The document reviews different person detection models and their use in search and rescue.
3) It discusses recent work involving using efficient neural networks like MobileNet for object detection from drones. Other work involves using depth sensors and pose estimation for person tracking, as well as using distributed deep learning
Object Detection for Autonomous Cars using AI/MLIRJET Journal
The document discusses using machine learning and computer vision techniques for object detection in autonomous vehicles. Specifically, it proposes using the Single Shot Detector (SSD) algorithm to identify and classify objects around a self-driving car from camera images. The SSD model was trained on a dataset to detect common objects like cars, people, buses etc. and estimate bounding boxes around detected objects. The methodology uses OpenCV and TensorFlow to implement SSD on images from a webcam in real-time. While bounding boxes were sometimes inconsistent in dense traffic, detection was more accurate for objects closer to the camera or in less crowded scenarios. The goal is to demonstrate how computer vision allows autonomous vehicles to perceive their surroundings.
This document introduces the Named Data Networking (NDN) project, which proposes a new internet architecture called NDN. The key points are:
1. NDN is based on named data rather than endpoints, allowing data to be directly addressed and routed instead of requiring encapsulation in endpoint-to-endpoint communication.
2. The NDN architecture uses hierarchical names for data, in-network caching, interest-based forwarding using a pending interest table, and data-centric security with signatures.
3. The document outlines several research areas needed to develop and evaluate the NDN architecture, including routing scalability, fast name lookup, caching policies, security mechanisms, and applications to drive adoption.
The document summarizes two projects completed during an internship at JCDecaux:
1. Psychographic Spatial Segmentation (PSA) aimed to cluster geographic regions based on residents' preferences from an external data provider, but the project was interrupted when data access stopped.
2. Ads Recognition sought to classify ad image content to restrict inappropriate ads, comparing free models to Amazon Rekognition. Various datasets and models were evaluated to map image tags to business needs.
As Zambia's population grows and residential and industrial developments expand, the demand for efficient and effective transit services is increasing in developing areas including (Lusaka, Kitwe, Livingstone etc.)
SMARTbus system will provide customers with a variety of features including automated voice stop announcements, automated exterior route and destination announcements, automated passenger counter, and GPS location services. Meaning, SMARTbus riders can track their bus in real-time. No more standing at a bus stop wondering if a bus already came or is stuck somewhere in traffic. Smart bus technology delivers up-to-the-minute bus departure information for every bus stop in the city/route.
Q. Why are new smart bus transportation systems such as SMARTbus needed?
These systems are expected to enable bus operators to improve vehicle safety, and schedule reliability for the fixed-route services.
Additionally, providing enhanced service and service information to riders as they travel.
Automatically announce the next scheduled stop and display this information on a sign inside the bus in real time.
Automatically announce the bus destination and time estimate to waiting passengers at each scheduled bus stop.
Advanced High Speed Black Box Based Vehicle Crash Investigation Systemijtsrd
This document describes a proposed vehicle black box system that records data during crashes to aid in accident investigations. It would include sensors to monitor speed, location via GPS, and video footage via front and back cameras. The data would be stored temporarily in memory and could then be retrieved for analysis after a crash. The system aims to objectively determine the causes and sequence of events leading up to and during accidents. This could help insurance claims and reduce pending cases by providing clear evidence.
The document presents a thesis that developed and evaluated a localization component for mobile service applications. The thesis implemented a platform called YourWay! that collects contextual data from distributed sources and facilitates instant location information to mobile users. Empirical evaluation of YourWay! assessed user experience in indoor and outdoor environments. Results showed user experience was more reliable within community WiFi infrastructure, especially indoors, depending on access point coverage, density, and structure.
This document discusses improving the scalability, availability, and reliability (SAR) of virtualization for operating systems, networks, and storage. It examines Oracle Solaris virtualization technology (Solaris Containers) which allows multiple isolated operating systems to run simultaneously on the same hardware. The author aims to build fault tolerance into the Solaris operating system and its virtual environments (Solaris Containers) through designs that optimize network connectivity using IPMP and aggregation, storage using Solaris Volume Manager (SVM) and ZFS, and shared resource allocation. The thesis will evaluate how these designs increase SAR when outages occur at the operating system, network, or storage levels for both planned and unplanned maintenance scenarios.
This thesis presents a framework for integrated uncertainty modeling and visualization. It aims to address four major barriers: (1) users must anticipate their uncertainty needs before building models, (2) uncertainty parameters are treated the same as variables, (3) uncertainty propagation must be manually managed, and (4) visualization techniques are largely incompatible with different uncertainty types. The framework encapsulates uncertainty into atomic variables, automates uncertainty propagation, and abstracts visual mappings from the underlying uncertainty type. It extends the traditional spreadsheet to intrinsically support uncertainty modeling and visualization. Case studies demonstrate the framework for business planning, financial decision support, and process specifications.
Automatic Detection of Unexpected Accidents Monitoring Conditions in TunnelsIRJET Journal
The document describes a proposed system to automatically detect accidents and unexpected events in road tunnels using video footage from CCTV cameras. The system would use object detection and tracking technology, along with a Faster R-CNN deep learning model, to identify objects like vehicles, fires, and people in tunnel videos. It would monitor the movement and position of detected objects over time to identify accidents or other irregular events. If an accident is detected, a signal would be sent to alert authorities so they can respond quickly. The system aims to address the challenges of limited visibility and low-quality images from tunnel CCTV cameras.
Traffic Congestion Detection Using Deep Learningijtsrd
Despite the huge amount of traffic surveillance videos and images have been accumulated in the daily monitoring, deep learning approaches have been underutilized in the application of traffic intelligent management and control. Traffic images, including various illumination, weather conditions, and vast scenarios are considered and preprocessed to set up a proper training dataset. In order to detect traffic congestion, a network structure is proposed based on residual learning to be pre trained and fine tuned. The network is then transferred to the traffic application and retrained with self established training dataset to generate the Traffic Net. The accuracy of Traffic Net to classify congested and uncongested road states reaches 99 for the validation dataset and 95 for the testing dataset. The trained model is stored in cloud storage for easy access for application from anywhere. The proposed Traffic Net can be used by a regional detection of traffic congestion on a large scale surveillance system. The effectiveness and efficiencies are magnificently demonstrated with quick detection in the high accuracy in the case study. The experimental trial could extend its successful application to traffic surveillance system and has potential enhancement for intelligent transport system in future. Anusha C | Dr. J. Bhuvana "Traffic Congestion Detection Using Deep Learning" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-6 | Issue-2 , February 2022, URL: https://www.ijtsrd.com/papers/ijtsrd49401.pdf Paper URL: https://www.ijtsrd.com/computer-science/other/49401/traffic-congestion-detection-using-deep-learning/anusha-c
Accident vehicle types classification: a comparative study between different...nooriasukmaningtyas
Classifying and finding type of individual vehicles within an accident image are considered difficult problems. This research concentrates on accurately classifying and recognizing vehicle accidents in question. The aim to provide a comparative analysis of vehicle accidents. A number of network topologies are tested to arrive at convincing results and a variety of matrices are used in the evaluation process to identify the best networks. The best two networks are used with faster recurrent convolution neural network (Faster RCNN) and you only look once (YOLO) to determine which network will identifiably detect the location and type of the vehicle. In addition, two datasets are used in this research. In consequence, experiment results show that MobileNetV2 and ResNet50 have accomplished higher accuracy compared to the rest of the models, with 89.11% and 88.45% for the GAI dataset as well as 88.72% and 89.69% for KAI dataset, respectively. The findings reveal that the ResNet50 base network for YOLO achieved higher accuracy than MobileNetV2 for YOLO, ResNet50 for Faster RCNN with 83%, 81%, and 79% for GAI dataset and 79%, 78% and 74% for KAI dataset.
Android Project report on City Tourist Location based services (Shuja ul hassan)Shuja Hassan
The aim to design and develop this project is to produce a
tourist guide for Skardu city, which can eefficiently guides the
tourist who visits Skardu. The Android tourist guide can be use in place of professional guide due to many reasons like reduce cost of guide, get more accurate information needed for decision making, giving weather and social networking services.The tourists can use this guide for different purposes like searching a location , calculate distance between two locations,getting basic textual information, pictorial information of location which normally we could not get in default Google maps.The guide uses Google Map API, global
positioning system( GPS), Internet and cellular data to provide
its services.
Shuja ul Hassan
IT Teacher
Android Developer
shuja2good@gmail.com
The document describes the development of a traffic sign recognition system using machine learning techniques. It involves building a convolutional neural network (CNN) model to classify images of traffic signs into different categories. The front-end will utilize libraries like Pandas, NumPy, Matplotlib and OpenCV for data processing and visualization. Tkinter will be used for the graphical user interface. The back-end will use TensorFlow and Keras deep learning frameworks to develop the CNN model for traffic sign classification. The system aims to accurately detect and recognize traffic signs to help with autonomous driving.
Innovative Payloads for Small Unmanned Aerial System-Based PersonAustin Jensen
This document describes Austin Jensen's doctoral dissertation titled "Innovative Payloads for Small Unmanned Aerial System-Based Personal Remote Sensing and Applications". The dissertation focuses on using small unmanned aerial systems (UAS) as remote sensing platforms to collect aerial imagery and track radio-tagged fish. It presents novel methods for calibrating imagery from commercial cameras on UAS and estimating the location of radio-tagged fish using multiple UAS. The calibration methods allow imagery to be orthorectified and used to deliver actionable information to end users. Simulations evaluate the fish tracking methods and predict their real-world performance. The dissertation contributes to advancing UAS-based remote sensing for applications such as precision agriculture, vegetation mapping,
This document provides guidelines for integrating sensor data into a Grid-enabled Spatial Data Infrastructure (GSDI). It describes current sensor technologies, including wireless sensor networks and the Sensor Web. It analyzes different solutions for integrating sensor measurements into a GSDI to support the EnviroGRIDS project, which aims to build capacity for environmental observation in the Black Sea region. The document recommends adopting the Sensor Observation Service to enable interoperable access to sensor data within the GSDI and describes a proposed system using wireless sensor networks, mobile communication units, and various client applications to integrate real-time sensor data.
This document provides an introduction to spatial data analysis using open source software R. It discusses spatial data concepts like spatial reference systems and coordinate reference systems. It describes how to create, load and visualize spatial point, line and polygon data in R. It also covers digital image processing and classification in QGIS. Methods discussed include spatial point pattern analysis, interpolation, geostatistics, spatial modeling and accuracy assessment. The document uses data from Kilimanjaro region as an example to demonstrate these spatial analysis techniques in R and QGIS.
An Audit of Augmented Reality Technology UtilizationIRJET Journal
This document summarizes research on the applications and uses of augmented reality technology. It discusses how augmented reality has been used in various fields such as engineering, tourism, defense, education, public safety, gaming, and medicine. Some key applications mentioned include using AR for construction projects, enhancing the tourist experience, military training, improving education, and assisting medical procedures. The document also reviews challenges with AR such as registration and detection errors and discusses potential future research directions.
Similar to Real Time Vehicle Detection for Intelligent Transportation Systems (20)
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Real Time Vehicle Detection for Intelligent Transportation Systems
1. DEGREE PROJECT
Real Time Vehicle Detection for Intelligent
Transportation Systems
Christián Ulehla
Elda Shurdhaj
Master Programme in Data Science
2023
Luleå University of Technology
Department of Computer Science, Electrical and Space Engineering
3. ABSTRACT
This thesis aims to analyze how object detectors perform under winter weather conditions,
specifically in areas with varying degrees of snow cover. The investigation will evaluate the
effectiveness of commonly used object detection methods in identifying vehicles in snowy
environments, including YOLO v8, Yolo v5, and Faster R-CNN. Additionally, the study explores
the method of labeling vehicle objects within a set of image frames for the purpose of high-quality
annotations in terms of correctness, details, and consistency. Training data is the cornerstone upon
which the development of machine learning is built. Inaccurate or inconsistent annotations can
mislead the model, causing it to learn incorrect patterns and features. Data augmentation
techniques like rotation, scaling, or color alteration have been applied to enhance some models'
robustness to recognize objects under different alterations. The study aims to contribute to the field
of deep learning by providing valuable insights into the challenges of detecting vehicles in snowy
conditions and offering suggestions for improving the accuracy and reliability of object detection
systems. Furthermore, the investigation will examine edge devices' real-time tracking and
detection capabilities when applied to aerial images under these weather conditions. What drives
this research is the need to delve deeper into the research gap concerning vehicle detection using
drones, especially in adverse weather conditions. It highlights the scarcity of substantial datasets
before Mokayed et al. published the Nordic Vehicle Dataset. Using unmanned aerial vehicles
(UAVs) or drones to capture real images in different settings and under various snow cover
conditions in the Nordic region contributes to expanding the existing dataset, which has previously
been restricted to non-snowy weather conditions. In recent years, the leverage of drones to capture
real-time data to optimize intelligent transport systems has seen a surge. The potential of drones
in providing an aerial perspective efficiently collecting data over large areas to precisely and timely
monitor vehicular movement is an area that is imperative to address. To a greater extent, snowy
weather conditions can create an environment of limited visibility, significantly complicating data
interpretation and object detection. The emphasis is set on edge devices' real-time tracking and
detection capabilities, which in this study introduces the integration of edge computing in drone
technologies to explore the speed and efficiency of data processing in such systems.
Key Concepts: Unmanned aerial vehicles (UAVs), vehicle detection, CNN architecture, snow
cover, NVD, Yolov5s, Yolov8s, Faster R-CNN, real-time processing, edge devices.
4. Dedication and Acknowledgments
First and foremost, we convey our profound appreciation to our advisor, Hamam Mokayed, whose
expertise, understanding, and patience, added considerably to our graduate experience. His
mentorship, encouragement, and candid criticism have been invaluable.
We would also like to express our heartfelt thanks to the NVD team, especially Amirhossein
Nayebiastaneh, for their unwavering support throughout the thesis. Their provision of resources
and expertise significantly enriched our research.
In the spirit of true collaboration, we extend our deepest gratitude to each other as co-authors of
this thesis. This journey has been a testament to our shared dedication, mutual respect, and
collective pursuit of excellence. Sharing the challenges and hard work made this journey all the
more rewarding.
To our incredible families and loved ones,
This thesis would not have been possible without your unwavering support, occasional pep talks,
and endless supply of coffee (and cookies!). You've been the cheerleaders in the corner and the
providers of continuous flow of care packages that truly kept us going.
Thank you all!
Elda and Chrisitián
5. CONTENTS
CHAPTER 1 ................................................................................................................................ 1
1. Introduction ..................................................................................................................... 1
1.1 Background................................................................................................................... 1
1.2 Motivation ..................................................................................................................... 3
1.3 Problem Definition......................................................................................................... 5
1.4 Goals and Delimitations ................................................................................................ 6
1.5 Research Methodology ................................................................................................. 7
1.6 Structure of the Thesis .................................................................................................. 9
CHAPTER 2 ...............................................................................................................................12
2. Contextual Framework ...................................................................................................12
2.1 Computer Vision and Vehicle Detection .......................................................................12
2.2 Deep Learning Techniques for Vehicle Detection.........................................................16
2.3 Aerial Image Detection Challenges ..............................................................................26
2.4 Real-Time Tracking and Detection on Edge Devices ...................................................28
2.5 Evaluation Metrics and Benchmarking .........................................................................29
2.6 Relevant Studies, Research Gaps and Future Directions.............................................36
CHAPTER 3 ...............................................................................................................................39
3. Thesis Methodology Approach .......................................................................................39
3.1 Dataset Acquisition and Description.............................................................................39
3.2 Data Splitting................................................................................................................41
3.3 Frame and Annotation Processing ...............................................................................42
3.4 Training Techniques, Models .......................................................................................45
3.5 Edge Device Application ..............................................................................................48
3.6 Evaluation Metrics and Benchmarking .........................................................................50
3.7 Ethical Considerations .................................................................................................54
CHAPTER 4 ...............................................................................................................................56
4. Analysis, Experiments and Results.................................................................................56
4.1 Data Analysis and Experiments....................................................................................56
4.2 Results.........................................................................................................................63
6. CHAPTER 5 ...............................................................................................................................64
5. Discussion, Final Remarks, and Prospects for Future Research ....................................64
5.1 Discussion ...................................................................................................................64
5.2 Conclusions .................................................................................................................66
5.3 Future Work .................................................................................................................67
References ..............................................................................................................................69
7. List of Abbreviations
CNN Convolutional Neural Networks
CV Computer Vision
CVAT Computer Vision Annotation Tool
DOTA Dataset of Object Detection in Aerial images
NVD Nordic Vehicle Dataset
OBB Oriented Bounding Boxes
R-CNN Region-based Convolutional Neural Networks
ReLU Rectified Linear Unit
UAV Unmanned Aerial Vehicles
YOLO You Only Look Once
8. List of Table and Figures
Table 1. Specification of Freya unmanned aircraft Mokayed et al. ............................................40
Table 2. Specification of Freya unmanned aircraft Mokayed et al. ............................................41
Table 3. Summary of NVD dataset processing..........................................................................42
Table 4. Specifications of PC on which model was trained on...................................................48
Table 5. Results of selected detection models. .........................................................................63
Figure 1. Machine Learning Family ...........................................................................................16
Figure 2. YOLO evolution..........................................................................................................20
Figure 3. Faster R-CNN architecture.........................................................................................21
Figure 4. Illustration of precision-recall curve ............................................................................31
Figure 5. Intersection over Union ..............................................................................................33
Figure 6. mAP flowchart calculation ..........................................................................................35
Figure 7. Steps how to calculate mAP ......................................................................................35
Figure 8. Confusion Matrix ........................................................................................................36
Figure 9. Showcase of dataset images .....................................................................................40
Figure 10. Difference between inner bounding boxes with rotation and outer bounding boxes
without rotation.........................................................................................................43
Figure 11. Illustration of gaussian elliptical function over heatmap ............................................44
Figure 12. From left to right: original image, produced heatmaps, overlay of both.....................45
Figure 13. Example of Yolo model sizes ...................................................................................45
Figure 14. CNN network architecture made as pytorch nn.Module............................................46
Figure 15. Our CNN network architecture .................................................................................47
Figure 16. Image of chosen edge-device: Raspberry Pi............................................................50
Figure 17. Difference between predicted heatmap (left) and heatmap after threshold mapping
(right)........................................................................................................................51
Figure 18. Process of squaring the results of heatmap after threshold mapping........................51
Figure 19. From top-left to bottom-right: Squared heatmap after applying threshold, squared
heatmap before applying threshold, Replacement of sizes, Final removal of small
bounding boxes........................................................................................................52
Figure 20. From left to right: Overlay of ground-truth bounding boxes with image and overlay of
predicted bounding boxes with image.......................................................................53
Figure 21. Illustration of general effect of varying learning rates................................................57
Figure 22. Showcase when model stopped at sub-optimal solution of outputting all-0 heatmap.
Training/Validation loss stopped decreasing and output heatmap is black ................................58
Figure 23. Effect of different learning rates on training of our network. Graphs describe evolving
training and validation loss over epochs....................................................................................59
9. Figure 24. Training/validation loss of our final model during training .........................................61
Figure 25. Difference in size between actual bounding boxes (left) and predicted bounding
boxes (right) .............................................................................................................61
Figure 26. Confusion matrix of whole testing dataset for 0.9 quantile thresholding....................62
Figure 27. Difference between ground-truth (left) and predictions (right) on image from sunny
video, showcasing problem of our model with generalization ...................................65
10. 1
CHAPTER 1
Thesis Introduction
1. Introduction
1.1 Background
In recent years, the field of computer vision has witnessed remarkable progress, driven mainly by
the rapid advancements in deep learning techniques (Mokayed & Mohamed, 2014), such as
document classification analysis (Kanchi et al., 2022) or medical studies (Voon et al., 2022).
Among the numerous applications of computer vision, vehicle detection plays a crucial role in
diverse domains, such as intelligent transportation systems (Dilek & Dener, 2023), autonomous
driving, traffic management, surveillance systems (Mokayed et al., 2023), environmental
monitoring (Sasikala et al., 2023; Zhou et al., n.d.,), document analysis (Khan et al., 2022;
Mokayed et al., 2014), and many other fields. Accurate and efficient detection of vehicles in real-
world scenarios poses significant challenges due to varying weather conditions, complex
backgrounds, occlusions, and diverse vehicle appearances (Mokayed et al., 2022; Farid et al.,
2023). To address these challenges, researchers have turned to the power of deep learning,
specifically Convolutional Neural Networks (CNNs), as a promising solution.
The ubiquity of vehicles in our daily lives underscores the significance of efficient vehicle
detection systems in ensuring safety, optimizing traffic flow, and enabling intelligent
transportation systems. Traditional approaches to vehicle detection often relied on handcrafted
features and rule-based algorithms, which limited their adaptability to varying environmental
conditions. Moreover, as the complexities of real-world scenarios increased, the demand for more
accurate and sophisticated detection techniques became evident (Xu et al., 2021)
11. 2
Real-time processing in computer vision means analyzing visual data like images and video frames
instantly without any significant delay. This fast analysis helps make quick decisions based on the
camera data. Real-time processing ensures that vehicles can be detected and tracked accurately
and efficiently, providing valuable information for tasks such as traffic management, autonomous
driving, and surveillance systems (Dilek & Dener, 2023). By processing images or video streams
in real time, computer vision algorithms can quickly identify and classify vehicles, enabling timely
actions to be taken.
Deep learning, a subfield of machine learning, has emerged as a transformative force in computer
vision due to its ability to learn intricate patterns and representations directly from data.
Convolutional Neural Networks (CNNs), particularly, have showcased exceptional performance
in various image recognition tasks, including object detection (Arkin et al., 2023). Leveraging
CNNs for vehicle detection presents a promising solution to overcome conventional methods'
challenges, thus driving advancements in this domain (Arkin et al., 2023)
The quality of images plays a crucial role in how well deep learning Convolutional Neural
Networks (CNNs) perform in image detection tasks (Dong et al., 2016). CNNs rely on high-quality
input images to accurately detect and recognize objects. Ensuring that CNNs are provided with
well-processed, high-quality images is essential for reliable and effective object detection in
practical applications. To ensure the effectiveness and reliability of object detection systems,
studying and improving the high accuracy of detection models is of utmost importance. The
accuracy of models directly affects their ability to identify and classify objects correctly within
images or videos. Object detection models can provide accurate and precise information about the
presence, location, and type of objects by achieving high accuracy.
Detecting and recognizing vehicles in drone images enhances safety in various applications.
Drones have significantly impacted image detection in deep learning by offering a unique aerial
perspective, enabling efficient data collection over large areas. Equipped with high-resolution
cameras, drones provide detailed imagery, facilitating precise object detection and recognition in
aerial images, even in real-time.
12. 3
However, detecting vehicles in drone-captured images poses challenges, especially when the
images are taken at oblique angles. These challenges include non-uniform illumination,
degradation, blurring, occlusion, and reduced visibility (Mokayed et al., 2023; Mokayed et al.,
2021). Additionally, adverse weather conditions, such as heavy snow, further complicate vehicle
detection in snowy environments due to limited visibility and data interpretation difficulties.
Despite the progress made in object detection in natural images, the application to aerial images
remains distinct and poses unique challenges (Ding et al., 2022). Additionally, there is a lack of
sufficient data and research specifically focused on detecting vehicles in snowy conditions using
images captured by unmanned aerial vehicles (UAVs), highlighting an important research gap in
this field (Mokayed et al., 2023). Addressing these challenges and further investigating vehicle
detection in snowy conditions using drones is crucial to advance the capabilities of deep learning
models and improve safety in various domains.
1.2 Motivation
The lack of sufficient data and research on vehicle detection using drones in snowy conditions,
underscores this study's need. By training on a unique dataset of vehicles captured by UAVs in
diverse snowy weather conditions in the Nordic region, the thesis aims to advance the field. The
goal is to shed light on vehicle detection challenges in snowy conditions using drones and offer
valuable insights to improve the accuracy and reliability of object detection systems in adverse
weather scenarios. Moreover, the research aims to explore edge devices' real-time tracking and
detection capabilities when applied to aerial images under snowy conditions.
Mokayed et al. tackle the complex problem of vehicle detection and recognition in drone images,
explicitly focusing on challenging weather conditions like heavy snow. Their study highlights the
significance of using appropriate training datasets for developing effective detectors, especially in
adverse weather conditions. They provide the scientific community with a unique dataset captured
by unmanned aerial vehicles (UAVs) in the Nordic region, encompassing diverse snowy weather
conditions such as overcast with snowfall, low light, patchy snow cover, high brightness, sunlight,
and fresh snow, along with temperatures below -0 degrees Celsius.
13. 4
Building upon the findings of Mokayed et al., this thesis aims to further investigate the
performance of object detectors under different winter weather conditions with varying degrees of
snow cover. Additionally, it seeks to evaluate the effectiveness of widely used object detection
methods, such as YOLO v8, YOLO v5, and Faster RCNN, in detecting vehicles in snowy
environments. YOLO and Faster RCNN are some of the most widely used models for object
detection tasks, including aerial vehicle detection. YOLO’s newer versions, specifically v8 and
v5, stand out due to their speed, enhanced detection accuracy, and robustness. On the other hand,
Faster-RCNN, though not as fast as YOLO, it still offers relatively quick detection times compared
to the earlier RCNN models, which is vital for real-time applications. Furthermore, the
investigation explores data augmentation techniques to enhance detection performance in
challenging snowy scenarios.
Image detection from aerial images on edge devices is pivotal to this research. Edge devices refer
to computing devices or systems located close to the data source, such as onboard processors on
the drone or specialized hardware integrated into the camera system. By performing image
processing locally on the edge device, the thesis aims to exploit the benefits of reduced latency
and bandwidth optimization, which are crucial in real-time or near-real-time applications. In real-
world applications, due to adverse weather in northern regions, there might be a necessity for "on-
board" detection for real-time results because there might not be a stable connection. This approach
aligns with the intention to improve the efficiency and effectiveness of object detection in aerial
images.
In summary, the motivation behind this thesis lies in addressing the research gap in vehicle
detection under snowy conditions and advancing the field of deep learning-based image detection
in adverse weather scenarios. By leveraging the work of Mokayed et al. and presenting a novel
dataset, this study aspires to contribute to developing more robust and efficient detection
techniques with potential applications in safety, surveillance, and other critical domains.
Moreover, the investigation into edge devices' detection capabilities adds a valuable dimension to
the research, aiming to enhance real-time capabilities and optimize resource utilization in aerial
image processing.
14. 5
1.3 Problem Definition
As mentioned in the motivation section of this thesis, it aims to address the challenges of vehicle
detection and recognition in drone images under various snowy weather conditions in the Nordic
region on edge devices. By analyzing the performance of existing methods, understanding the
impact of adverse weather, evaluating detection under diverse snowy scenarios, and exploring data
augmentation techniques, the thesis intends to advance the field of deep learning-based image
detection and contribute to safer and more efficient transportation systems and surveillance
applications. Additionally, investigating the application of edge devices for real-time tracking and
detection will further enhance the potential applications of the research findings in real-world
scenarios.
The following research questions will guide the investigation of this thesis:
1. How well do STOA vehicle detectors operate on hardware-constrained devices required
for real-time scenarios?
2. How does the performance of the introduced methodology in this work compare to existing
state-of-the-art techniques in terms of both accuracy and processing speed?
3. Which of the detection models can be effectively employed for real-time scenarios on the
edge device we have selected?
Research Objectives
Evaluation of Existing Techniques:
The first research question aims to evaluate the effectiveness of widely used vehicle detection
techniques, namely YOLO v8, YOLO v5, and Fast RCNN, in snowy weather conditions on edge
devices. By comparing the performance of these methods, the thesis seeks to identify which
approach is better suited for detecting vehicles in snowy environments, contributing to the
advancement of vehicle detection technology.
15. 6
Assessing the comparative performance:
The second research question seeks to quantify the degree to which the proposed methodology
excels or needs to be revised in correctly identifying vehicles in snowy weather conditions. This
assessment is important for determining whether the proposed approach represents a significant
improvement over existing accuracy and processing speed techniques.
Practical application:
The third research question focuses on the practical application of the detection models on chosen
edge devices in real-time scenarios. This question is crucial in determining the feasibility and
usability of the proposed methodology in real-world settings. This information is vital for
understanding the potential impact and usability of the research findings in practical settings,
ultimately contributing to the advancement of vehicle detection technology in snowy conditions.
1.4 Goals and Delimitations
This thesis may encounter certain limitations despite its careful and rigorous academic research
and well-crafted research questions designed to explore various aspects of aerial image detection
on edge devices. These limitations can encompass the choice of object detection methods, the
variability of edge devices, and the practical deployment of the proposed techniques in real-world
scenarios.
While this thesis evaluates popular object detection techniques such as YOLO v8, YOLO v5, and
Faster R-CNN, it is essential to acknowledge that the field of computer vision offers a wide array
of algorithms and architectures. By focusing on specific methods, other potentially valuable
alternatives may not be thoroughly investigated, potentially leaving out valuable insights.
The investigation into edge devices' real-time tracking and detection capabilities of edge devices
assumes a certain level of uniformity among these devices. However, in practice, edge devices can
exhibit significant variations in terms of processing power, memory, and overall efficiency. These
differences might impact the generalizability of the findings to other edge device configurations.
16. 7
Although this research concentrates on the technical aspects of vehicle detection using drones and
edge devices in snowy conditions, the practical deployment of these methods in real-world settings
comes with its own set of challenges and operational constraints. The real-world environment can
introduce complexities that extend beyond the scope of this thesis, warranting further consideration
for successful implementation.
Despite these limitations, the thesis contributes valuable insights into aerial image detection on
edge devices, offering potential advancements in object detection systems. By acknowledging
these potential constraints, the research provides a foundation for further investigation and future
developments in the field, paving the way for improved safety, surveillance, and transportation
systems.
1.5 Research Methodology
The research methodology employed in this thesis encompasses several key steps to achieve the
objectives of real-time vehicle detection in snowy conditions using aerial images on edge devices.
The methodology includes data alteration, dataset augmentation, model training and validation,
setting up the edge environment, testing detection models on edge devices, and model evaluation
or benchmarking.
Dataset Collection:
The data collection process is described by Mokayed et al., as the dataset over which the detection
models in this thesis are trained and tested is the NVD from their paper. This step described how
data were collected, the features of the data, and the technology used for performing the action.
Dataset Alteration for Generating Heatmaps - Labels for Custom CNN:
Heatmaps are generated and used as labels for training a custom Convolutional Neural Network
(CNN) in this step. These heatmaps indicate the presence and location of vehicles within the
images, serving as ground truth data for CNN. The custom CNN is trained to detect and recognize
vehicles in snowy conditions effectively by incorporating these heatmaps as labels.
17. 8
Dataset Augmentation for Training:
To address the challenge of data scarcity, dataset augmentation techniques are employed in certain
models to expand the dataset size and diversity. Augmentation methods may involve random
rotations, flipping, and adjustments in brightness and contrast. These augmentations are intended
to enhance the training dataset to improve the model's ability to generalize during the training of
vehicle detection.
Model Training and Validation:
The custom CNN, along with other chosen object detection methods like YOLO v8, YOLO v5,
and Faster RCNN, undergoes training on the augmented dataset. The training process involves
optimizing the model's parameters using suitable optimization algorithms and loss functions. To
ensure the models' generalization and prevent overfitting, validation is performed using a separate
validation dataset.
Setting Up the Edge Environment:
In this phase, the edge environment is established to evaluate the models' real-time tracking and
detection capabilities. Setting up the edge environment using Raspberry Pi is a critical component
of the methodology as it enables evaluating the vehicle detection models in real-world conditions
with limited resources and processing capabilities. It helps ensure that the models are practical,
efficient, and capable of real-time tracking and detection. These are essential factors for successful
deployment in various applications such as drones, surveillance systems, and environmental
monitoring.
Testing the Detection Models on Edge Devices:
The trained detection models are deployed and tested on the edge devices within the established
environment. The models' performance is evaluated regarding their real-time processing
capabilities, accuracy in detecting vehicles under snowy conditions, and resource utilization.
18. 9
Model Evaluation or Benchmarking:
Evaluation and benchmarking are conducted to assess the detection models' effectiveness and
robustness. The models are tested on a separate test dataset containing real-world drone images
with varying weather conditions. Performance metrics, such as precision, recall, average precision,
and time, are computed to gauge the models' detection accuracy and reliability.
1.6 Structure of the Thesis
Chapter 1: Introduction
This chapter serves as an introductory section to the thesis, providing an in-depth overview of the
research study and establishing its contextual framework. The focal subject revolves around
contemporary advancements in vehicle detection techniques under diverse weather conditions,
specifically examining image data acquired through unmanned aerial vehicles.
The problem statements and the research questions/objectives outline the study's aim. This chapter
includes the motivation for why this research is essential and how it contributes to the existing
research on vehicle detection under heavy weather conditions. Finally, Any limitations that may
affect the research, the methodology outline, and the organization of the thesis are also part of this
section.
Chapter 2: Literature Review
This chapter provides a comprehensive overview and analysis of the existing literature and
research related to the thesis topic. It aims to demonstrate the authors' knowledge of the field and
the gaps in current knowledge that the thesis intends to address. This section focuses on the
challenges specifically related to using Unmanned Aerial Vehicles (UAVs) for image capture and
analysis like image resolution, motion blur, occlusions, and other factors that affect vehicle
detection using UAV imagery. Here, the literature on vehicle detection under various weather
conditions is examined. Different studies and methods that address weather's impact on detection
algorithms' accuracy are discussed. In this part, various object detection algorithms are reviewed.
This includes deep learning-based approaches like YOLO (You Only Look Once) and Fast RCNN
19. 10
(Region-based Convolutional Neural Networks). This section explores different data augmentation
techniques to enhance the training dataset for vehicle detection models. Data augmentation helps
increase the data's diversity, improve model generalization, and handle imbalanced datasets.
Chapter 3: Methodology
This section outlines the specific environment in which the research is conducted. It also explains
the data collection process, including the sources of data and any alterations made to the dataset to
suit the research requirements. It also provides a detailed account of how Unmanned Aerial
Vehicles (UAVs) data was collected. It explains the procedures and equipment used to capture the
images or videos, the flight parameters, and any other relevant details regarding the data
acquisition process. Here, the characteristics and properties of the dataset used in the research are
outlined. It includes information about the size of the dataset, the number of samples, the
distribution of data across different categories or classes, and any specific attributes or labels
associated with the data.
This part delves into the various object detection methods used in the thesis. It provides detailed
explanations of each method, including YOLO v8, YOLO v5, and Faster R-CNN, and how they
function in the research context. The methodology chapter serves as a roadmap for the research,
outlining how the data is collected, how the object detection methods are chosen and implemented,
and how the models are trained and validated. Additionally, it provides details about the study
area, the data sources, and any alterations made to the dataset. The chapter concludes by explaining
how the detection models are tested on edge devices, evaluated, or benchmarked, and how the
edge environment is set up for testing and analysis.
Chapter 4: Analysis, Experiments, and Results
The "Analysis, Experiments, and Results" section in the thesis details the practical aspects of the
research study. It involves conducting experiments to evaluate the performance of different object
detection methods and presenting the findings obtained based on the setup environment on the
chosen edge device.
Various evaluation metrics are introduced in this section to assess the effectiveness of the object
detection methods. These metrics, such as precision, recall, accuracy, and mean average precision
20. 11
(mAP), are commonly used to measure the methods' performance. This section comprehensively
examines and discusses the individual performances of three specific object detection methods are
comprehensively examined and discussed here. Each method's results are presented separately to
showcase how well they can detect vehicles under different weather conditions.
Moreover, a comparative analysis is included, where a direct comparison among the three object
detection methods is conducted. This analysis considers the performance metrics, computational
efficiency, and other relevant factors. The objective is to identify the most suitable method for the
specific task. Additionally, the impact of the data augmentation techniques applied in Chapter 3 is
evaluated in this chapter. This evaluation scrutinizes how the data augmentation process influenced
the object detection models' performance and discusses how much it enhanced the overall results.
Chapter 5: Discussion, Conclusions, and Future Work
This chapter addresses the challenges faced during the research process and any limitations that
may have influenced the outcomes. The discussion sheds light on the constraints that might have
impacted the research scope, data collection, or analysis. It also provides insights into potential
areas for improvement or further research.
This section also includes the key findings and the research outcomes summary. The chapter
reiterates the main points discussed throughout the thesis and provides a concise statement of the
conclusions drawn from the research. It answers the research questions or objectives outlined at
the beginning of the thesis and reflects on whether they were successfully addressed. It discusses
how the research findings contribute to the existing knowledge in the field and highlights their
potential impact on theory, practice, or real-world applications. This section may also address any
unexpected or noteworthy discoveries made during the study. The chapter outlines areas that could
benefit from further investigation based on the limitations and gaps identified in the current study.
It may propose extensions to the current research, novel approaches, or new directions to explore
in related fields.
References
21. 12
CHAPTER 2
Literature Review
2. Contextual Framework
2.1 Computer Vision and Vehicle Detection
The genesis of computer vision can be historically pinpointed back to the 1960s. During this
period, the demand for interpreting and assessing visual data led to the advent of pioneering
techniques, enabling computers to identify and categorize patterns and objects within the images
(Wiley & Lucas, 2018; Mokayed et al., 2021). Central to computer vision are operations such as
image processing, object detection, and pattern recognition. The foundational principle of
computer vision lies in its capability to transform raw images into numerical data, subsequently
utilized as computational inputs. Converting images into numerical data involves several steps,
including image acquisition, preprocessing, feature extraction, and often deep learning-based
methods (Spencer et al., 2019).
This transformation process is enabled through the arrangement of images at the pixel level,
wherein each pixel may be characterized by a numerical value, either in grayscale or as a
combination of numerical values (e.g., 255, 0, 0 in the RGB color model) (Tan & Jiang, 2018).
Grayscale images represent each pixel's brightness level using a single numerical value. The value
typically ranges from 0 (black) to 255 (white), with varying shades of gray in between. Grayscale
is commonly used for simplicity and efficiency in image processing (Tan & Jiang, 2018). RGB
color images, on the other hand, use a combination of three numerical values for each pixel,
representing the intensity of red, green, and blue channels.
22. 13
As a sub-discipline of artificial intelligence, computer vision facilitates the capacity of machines
to decode and act upon information derived from visual sources like photos, videos, and other
image-based inputs (Sharma et al., 2021).
One particular area of interest is vehicle detection, which involves identifying and tracking
vehicles in real-time using computer vision techniques.
Several vehicle detection approaches have been proposed, including feature-based, appearance-
based, and deep learning-based methods. Feature-based methods rely on handcrafted features such
as edges, corners, and color histograms to detect vehicles, while appearance-based methods use
templates or classifiers trained on vehicle images to detect them (Nigam et al., 2023). On the other
hand, deep learning-based methods use neural networks to learn representations of vehicle features
directly from raw image data (Gupta et al., 2021; Chuangju, 2023).
With the advancements in technology and the increase in traffic on roads, vehicle detection has
become an essential part of road safety. The most current and relevant vehicle detection methods
include cameras, drones, and LiDAR technology. Each method has strengths and limitations, and
the most effective method depends on the specific application and environment.
Cameras are one of the most common methods for vehicle detection. They can be mounted on the
road or vehicles themselves and capture images of the road. Computer vision algorithms are then
used to detect vehicles (Quang Dinh et al., 2020). While this method is effective in clear weather
conditions, it can struggle in heavy rain, snow, or bright sunlight.
Drones are another method for vehicle detection that has gained popularity in recent years. Drones
equipped with cameras can capture images of the road from above and use computer vision
algorithms to detect vehicles (Wang et al., 2022). This method effectively detects vehicles in
different environments, such as parking lots or highways. However, it may not be practical in some
situations, such as the interference of buildings in densely populated urban areas, adverse weather
conditions, or even in areas with limited or no internet connectivity.
23. 14
LiDAR technology is another method for vehicle detection that has gained popularity in recent
years. It uses laser beams to detect objects and can create a 3D map of the environment (Zhou et
al., 2022; Khan et al., 2022). This method is effective in detecting vehicles in different
environments, the same as drones. However, it may be expensive and impractical in some
situations.
Radar is another method for detecting vehicles that uses radio waves to detect objects. This method
is effective in all weather conditions but may have limitations in detecting smaller vehicles such
as motorcycles (Li et al., 2022).
Machine learning algorithms are also developed for vehicle detection. These algorithms can learn
to recognize different types of vehicles and adapt to different environments. They are often
combined with cameras, drones, and LiDAR technology to improve vehicle detection accuracy
(Khan et al., 2022). One of the benefits of machine learning algorithms is their ability to adapt to
changing environments. They can learn to recognize different weather conditions, lighting
conditions, and road conditions and adjust their detection methods accordingly. This makes them
particularly effective in areas where conditions may change frequently, such as highways or
construction zones. Another benefit of machine learning algorithms is their ability to detect
partially obscured or difficult-to-see vehicles. For example, they can detect vehicles that are
partially hidden by trees or buildings or that are moving quickly through a crowded area. This can
improve road safety by alerting drivers to potential hazards that they may not have noticed
otherwise. However, there are also some limitations to machine learning algorithms. They require
large amounts of data for training, which can be time-consuming and expensive. Additionally, they
may not be as accurate as other vehicle detection methods in certain situations. For example, they
may struggle to identify vehicles that are very similar in appearance, such as two identical cars
parked next to each other.
Two popular deep learning-based approaches for vehicle detection are the You Only Look Once
(YOLO) algorithm and the Region-based Convolutional Neural Network (Fast R-CNN) algorithm.
YOLO utilizes a single neural network to perform object detection and classification in real-time,
24. 15
while the Faster R-CNN algorithm uses a two-stage process first to propose regions of interest and
then classify them as either vehicles or non-vehicles (Maity et al., 2021).
Several datasets have been established to develop and evaluate vehicle detection algorithms, which
offer a diverse range of real-world scenarios and challenging conditions to test the strength and
accuracy of vehicle detection algorithms. Despite recent advancements, challenges persist in
improving the accuracy and speed of detection in complex environments and adverse weather
conditions.
Some of the most common challenges include overfitting and underfitting, the limitation of labeled
training data, ineffective utilization of hierarchical features, dependency on field experts for
labeling, and limitations inherent to transfer learning (Alzubaidi et al., 2021). Overfitting, a
phenomenon where the CNN model learns too well from the training dataset to be unable to
generalize to new data, typically occurs when training on small datasets or utilizing fewer layers
in the model (Gallagher, 2022). Its counterpart, underfitting, arises when the model lacks the
complexity to discern the necessary features critical for object detection, thereby impeding
accuracy and generalization (Baheti, 2022). Finding the right balance between the complexity of
the model and the amount of training data is key to avoiding these problems.
The architecture of CNN models, especially those with a smaller number of layers, worsens the
problem because they cannot properly use the layered information present in big datasets. This
leads to not-so-great results in tasks involving identifying objects. It is essential, therefore, to
develop CNN models that can make good use of this layered data to boost their effectiveness
significantly.
Lastly, although transfer learning has become a powerful tool to overcome these issues, it still has
downsides. A major problem is the mismatch between the type of data used in transfer learning
and the data in the target dataset. This can potentially reduce the accuracy and effectiveness of
CNN models in spotting objects in various tasks.
25. 16
As we stand on the cusp of autonomous vehicle development, the necessity for real-time object
detection algorithms that can function in dynamic and intricate settings is paramount. This
demands sophisticated algorithms capable of learning from substantial amounts of meticulously
labeled data and high-performance computing hardware to facilitate real-time processing.
2.2 Deep Learning Techniques for Vehicle Detection
Convolutional Neural Network
Convolutional Neural Network is a type of deep learning model in the Artificial Intelligence family
commonly used for various computer vision tasks, including image classification, object detection,
and image segmentation (Datagen, n.d.). CNNs are specifically designed to process and analyze
visual data, making them highly effective in tasks involving images and visual patterns (Taye,
2023). CNNs are designed explicitly for grid-like data, such as images, where the arrangement of
pixels has spatial significance. A CNN learns patterns in the image by using the dependencies
between neighboring cells.
Figure 1. Machine Learning Family (Under the Hood of Machine Learning - The Basics, n.d.)
26. 17
A CNN comprises an input layer, an output layer, and many hidden layers in between. These layers
perform operations that alter the data to learn features specific to the data. Some fundamental
Concepts in CNN for Object Detection (“Fundamental Concepts of Convolutional Neural
Network,” 2019) are:
a. Convolutional Layers: These layers use filters to scan an image and identify specific patterns or
features, such as edges or textures. The convolutional layers slide small filters over the input image,
computing dot products to detect patterns. Each filter specializes in recognizing a specific feature.
b. Activation Function: After convolution, an activation function is applied to the resulting feature
maps, introducing non-linearities that enhance the network's ability to capture complex
relationships. A commonly used activation function is ReLU, which keeps positive values and sets
negative values to zero.
c. Pooling Layers: Pooling layers reduce the spatial dimensions of the features by summarizing
the information in small regions while preserving the essential features.
d. Fully Connected Layer: The fully connected layer is typically found at the end of a CNN and is
responsible for classification. It takes input from the previous layers and generates the final output,
assigning probabilities to different object classes.
e. Region-based CNN (R-CNN): R-CNN is one of the first CNN models designed specifically for
object detection. It uses a sliding window approach to detect objects in an image. The architecture
consists of three modules: region proposal extraction, affine image warping, and category-specific
classification using SVM.
f. Mask R-CNN: Mask R-CNN is an extension of R-CNN that also predicts pixel-level
segmentation masks for objects. It uses a feature pyramid network (FPN) and a bottom-up pathway
to improve the extraction of low-level features. The architecture includes three branches for
bounding box prediction, object class prediction, and mask generation.
27. 18
These fundamental concepts form the building blocks of CNN models for object detection. They
enable the network to extract relevant features, classify objects, and generate accurate predictions.
During training, CNNs adjust their parameters (weights and biases) to minimize the difference
between the predicted class probabilities and the actual labels. This optimization process, called
backpropagation, iteratively updates the parameters to improve the network's accuracy.
In object detection models, CNN layers are utilized to extract features from input images, where
earlier layers detect simple patterns and deeper layers capture more complex features (Szegedy et
al., 2015). Incorporating CNNs into detection models has enabled real-time object detection with
architectures like YOLO (You Only Look Once) that process an entire image with a single CNN
(Redmon et al., 2016).
YOLO
YOLO is a real-time object detection algorithm that was introduced in 2015. It uses a regression
approach instead of classification and associates’ probabilities to detect objects using a single
convolutional neural network to spatially separate bounding boxes and associate probabilities to
each detected object (Jocher et al., 2022). The architecture of YOLO involves resizing the input
image, applying convolutions, and using techniques like batch normalization and dropout to
improve performance. YOLO has evolved over the years, with versions from v1 to v8 addressing
limitations such as detecting smaller objects and unusual shapes. It has gained popularity due to
its accuracy, generalization capabilities, and being open source. In this research, we have decided
to proceed with YOLO v5 and YOLOv8.
The Starting Point
The first version of YOLO, released in 2015, was a game-changer for object detection. However,
it had limitations, such as struggling to detect smaller images within a group and being unable to
detect new or unusual shapes.
Incremental Improvement
YOLOv2, also known as YOLO9000, introduced a new network architecture called Darknet-53,
which improved the accuracy and speed of object detection. It also used logistic regression for
28. 19
better bounding box prediction and introduced independent logistic classifiers for more accurate
class predictions.
Further Enhancements
YOLOv3 built upon YOLOv2 by performing predictions at different scales, allowing for better
semantic information and a higher-quality output image. It also achieved optimal speed and
accuracy compared to previous versions and other state-of-the-art object detectors.
Optimal Speed and Accuracy
YOLOv4 outperformed YOLOv3 in speed and accuracy, making it the most advanced version of
YOLO. It achieved optimal speed and accuracy compared to previous versions and other state-of-
the-art object detectors.
YOLOv5, developed by Meituan, a Chinese e-commerce company, focused on industrial
applications. It introduced a hardware-friendly design, an efficient decoupled head, and a more
effective training strategy, resulting in outstanding accuracy and speed on the COCO dataset.
Hardware-Friendly Design
YOLOv6, also developed by Meituan, continued to improve the hardware-friendly design of
YOLOv5. It introduced a more efficient backbone and neck design, further enhancing accuracy
and speed.
Trainable Bag-of-Freebies
YOLOv7 aimed to improve detection accuracy without increasing training costs. It focused on
increasing both inference speed and detection accuracy, making it a significant improvement over
previous version.
The Pinnacle of Efficiency and Accuracy
Using the improvements made in YOLOv7 as a foundation, YOLOv8 aims to be the best yet in
spotting objects in pictures or videos, working to be faster and more correct than ever before. It is
good at many things, and one big plus is that it works well with all the older YOLO versions. This
29. 20
makes it simple for users to try out different versions and see which one works the best, making it
a top option for people who want the newest YOLO tools but also want to use their older ones
(Spodarets & Rodriguez, 2023).
Figure 2. YOLO evolution (A Comprehensive Review of YOLO: From YOLOv1 to YOLOv8 and Beyond, 2023)
Faster R - CNN
Faster R-CNN was introduced by Ren et al., 2017, in their paper “Towards Real-Time Object
Detection with Region Proposal Networks”. It was developed to solve the computational
bottleneck in object detection systems caused by region proposal algorithms. Faster R-CNN is an
object detection algorithm that is widely used in computer vision tasks. It consists of two modules:
a deep, fully convolutional network that proposes regions and a Fast R-CNN detector that uses the
proposed regions. The Region Proposal Network (RPN) is a fully convolutional network that
predicts object bounds and objectness scores at each position. The RPN is trained to generate high-
quality region proposals, which are then used by the Fast R-CNN detector for object detection.
The RPN in Faster R-CNN enables nearly cost-free region proposals by sharing convolutional
features with the detection network. This reduces the computational bottleneck of region proposal
computation and allows the system to run at near real-time frame rates. On a more detailed level,
the backbone network, which in this case is Resnet, is used to extract features from the image.
These features are then fed into a Region Proposal Network (RPN) that generates region proposals,
potentially bounding box locations for objects (Yelisetty, 2020). Finally, a Box Head is used to
refine and tighten the bounding boxes.
30. 21
Features of Faster R-CNN are the Region Proposal Network (RPN), unified network, efficient
region proposals, and improved localization accuracy. The RPN in Faster R-CNN significantly
improves the localization accuracy of object detection, especially at higher Intersection over Union
(IoU) thresholds. It achieves higher mean Average Precision (mAP) scores compared to the Fast
R-CNN system.
Figure 3. Faster R-CNN architecture (KHAZRI, n.d.)
Detectron2
Detectron2 is an object detection library developed by Facebook's AI research (FAIR) team. It is
a complete rewrite of the original Detectron, which was released in 2018. The library is built on
PyTorch and offers a modular and flexible design, allowing for fast training on single or multiple
GPU servers (Wu, 2019). Detectron2 supports a wide range of object detection models. These
models are pre-trained on the COCO Dataset and can be fine-tuned with custom datasets. Some of
the detection models available in the Detectron2 model zoo include Fast R-CNN, Faster R-CNN,
and Mask R-CNN. These models are designed to detect and classify objects in images or videos.
Detectron2 also offers features like network quantization and model conversion for optimized
deployment on cloud and mobile platforms (Yelisetty, 2020). With its speed, scalability, and
extensive model support, Detectron2 is a powerful tool for training object detection models. It
31. 22
aims to advance machine learning by providing speedy training and addressing the challenges
faced when transitioning from research to production.
Detectron2 leverages the power of Faster R-CNN by providing a modular and extensible design.
This means that users can easily plug in custom module implementations into different parts of the
object detection system (Wu, 2019). This allows for easy experimentation and customization of
the models.
Important Object Detection Notions
Yolo Model Sizes
The YOLO (You Only Look Once) object detection model has different sizes for various use cases.
YOLOv5 offers different model sizes, allowing to choose the one that suits the specific
requirements. The sizes are typically referred to as small, medium, large, and extra-large. The size
of the model influences its speed and accuracy. Smaller models are faster but may not be as
accurate, while larger models offer better accuracy but may be slower (Jocher, 2021).
Loss Function
In the YOLO object detection system, the loss function is a composite function that measures the
differences between the predicted and true values for objectiveness, class predictions, and
bounding box coordinates. The loss function is designed to weigh errors in bounding box
coordinates and object detection more heavily than errors in class predictions. Generally, it
integrates mean squared error for bounding box predictions and logistic regression loss for class
predictions and objectness (Redmon et al., 2016).
In Faster R-CNN, the loss function, often referred to as multi-task loss, is a unified objective
function that merges the classification loss (usually log loss) and the bounding box regression loss
(usually smooth L1 loss). This approach allows the model to learn to classify objects and predict
32. 23
bounding box coordinates concurrently, facilitating more accurate object detection (Ren et al.,
2017).
ReLU
ReLU is a type of activation function that is defined as f(x)=max (0, x). This function outputs the
input directly if it is positive; otherwise, it will output zero. It has become popular for its
computational efficiency and its ability to mitigate the vanishing gradient problem to some extent
(ReLu Definition, n.d.).
He Weight Initialization for Relu
He Weight Initialization is a weight initialization technique specifically designed for ReLU
activation functions. It addresses the issue of poor convergence in deep networks that can occur
with other initialization methods (He et al., 2015). He et al., 2015, proposed this technique, which
initializes the weights of ReLU layers using a variance of Var (Wi)=2/n, where n in is the number
of inputs to the layer. Adopting this approach ensures the consistency between input and output
gradient variances, thereby enhancing network convergence and operational efficacy. Typically,
prior to initiating network training, the ReLU layer weights are adjusted in alignment with this
variance formula. By initializing the weights in this way, the network is better able to learn and
optimize the ReLU activation function, leading to improved performance (Aguirre & Fuentes,
2019). Since in this technique the weights are set to numbers that are not too small or too big, it
helps to prevent the vanishing gradient problem. It ensures that the variances in the activations of
hidden units do not increase with depth, leading to better convergence in deep networks (He et al.,
2015). YOLO and Faster-RCNN which are utilized in this thesis, are prone to issues related to
vanishing or exploding gradients and both often employ ReLU (Rectified Linear Unit) activations
in their architectures. Proper weight initialization, such as He initialization, can ensure gradients
remain in a manageable range, leading to faster convergence during training.
33. 24
Vanishing gradient problem
The vanishing gradient problem refers to the phenomenon where gradients of the loss function
become too small for the network to learn effectively, a problem often encountered in deep neural
networks with many layers. This occurs because the derivatives of the activation functions used in
the network become very small, causing the gradients to diminish as they are backpropagated
through the layers. This problem can impede the convergence of the network during training and
is one of the motivations for the development of alternative activation functions like ReLU (What
Is the Vanishing Gradient Problem? n.d.).
Spline Wavelet Transforms
Wavelet transforms are mathematical operations used for signal analysis and data compression.
They can be used for object detection by analyzing the frequency content of an image or signal.
The wavelet transform decomposes the image or signals into different frequency components,
allowing for the identification of specific patterns or features (Wavelet Transforms — GSL 2.7
Documentation, n.d.).
Wavelet transform is commonly used in deep learning architectures, specifically in convolutional
neural networks (CNNs). It can be integrated into CNN architectures to improve noise robustness
and classification accuracy. By replacing traditional down-sampling operations with wavelet
transform, the low-frequency component of the feature maps, which contains important object
structures, is preserved while the high-frequency components, which often contain noise, are
dropped (Li et al., 2020).
Based on the documentation of wavelet transforms (Wavelet Transforms — GSL 2.7
Documentation, n.d.), there are two approaches to how it can be used for object detection. One
approach is to apply the wavelet transform to the image or signal and then analyze the resulting
coefficients. By selecting the largest coefficients, which represent the most significant frequency
components, objects or features of interest can be identified. The remaining coefficients can be set
to zero, effectively compressing the data while retaining the important information.
34. 25
Another approach is to use the wavelet transform as a feature extraction technique. By applying
the wavelet transform to different regions of an image or signal, features such as edges, textures,
or shapes can be extracted. These features can then be used for object detection or classification.
Heatmaps and Bounding Boxes
Some important notions that are used throughout this research are heatmaps and bounding boxes.
In object detection, combining heatmaps and bounding boxes often involves utilizing heatmaps to
identify the central points or key points of objects and then using additional geometric attributes
to construct bounding boxes around identified objects. Using heatmaps, the CNN can accurately
and efficiently locate and track objects in an image, improving its performance over time by having
clear indicators of the objects' positions (Haq et al., 2022).
A neural network is used to generate a heatmap where the intensity of each pixel represents the
likelihood of that pixel being a central point or a key point of an object. Heatmaps can represent
additional attributes like object width, length, or even orientation. Then, a peak detection algorithm
is used to detect peaks in the heatmaps indicating an object's center. In this research, the technique
of thresholding is used for such detections. It involves selecting a threshold value, and then setting
all pixel values less than the threshold to zero and all pixel values greater than or equal to the
threshold to one (or some other specified value). This essentially divides the image into regions of
interest based on the intensity values of the pixels. From the detected central points, the network
predicts the other attributes that are needed to create the bounding boxes.
Haq et al utilize the heatmap technique for object detection and tracking. The authors use a Fully
Convolutional Neural Network (CNN) to generate heatmaps representing each image's desired
output. These heatmaps are created by placing a Gaussian distribution around the object's location
coordinates obtained from a text file. The purpose of these heatmaps is to provide feedback to the
CNN and improve its performance by indicating how well it is currently detecting and tracking
the object.
35. 26
Combining heatmaps and bounding boxes in this way allows for a more nuanced approach to
object detection, leveraging both the localization information provided by heatmaps and the
classificatory power of bounding boxes to achieve accurate object detection (Xu et al., n.d.).
Xu et al. is another paper that proposes a method that combines heatmaps and bounding boxes to
improve the detection performance in videos. Their method, called heatmap propagation, generates
a propagation heatmap by overlapping the heatmaps of detected objects from consecutive frames.
This propagation heatmap highlights the potential positions of each object's center with its
confidence score. Then, the paper combines the network output heatmap of the current frame with
the propagation heatmap to create a long-term heatmap. This long-term heatmap serves as a
temporal average and helps to compensate for low image quality and maintain stable detection.
The paper calculates a weighted size for bounding box prediction by combining the propagated
size information and the network output size. This weighted size is used to generate the final
bounding box prediction.
2.3 Aerial Image Detection Challenges
Aerial images are images that capture the Earth’s surface from above in different elevation levels
at a steep viewing angle. They are typically obtained from aircraft, drones, satellites, or other
airborne platforms. Due to the level of detail, they provide over a large spatial scale, they have
found usage in many scientific fields. Integrating AI technologies like deep learning and computer
vision with aerial imagery analysis has opened a new avenue in academic research, facilitating
more detailed and sophisticated analyses and predictions.
As with any other new technology, aerial images come with their own challenges, which are
subject to research. Identified challenges include arbitrary orientations, scale variations,
nonuniform object densities, and large aspect ratios (Ding et al., 2022). Capturing objects from
different angles and orientations due to the overhead view makes object detection more
challenging. Due to the variations in altitude, aerial images can vary significantly in size, which
brings the need for handling scale variations. Varying densities of objects across regions add to
36. 27
the complexity of object detection, as does having objects with large aspect ratios like, for example,
long and narrow structures of ships or vehicles.
Another set of challenges that needs to be overcome to ensure accuracy and reliability is obtaining
high-resolution imagery. The presence of harsh weather conditions or buildings can obscure parts
of the image, as is the issue of lighting and shadows, which can make it difficult to distinguish
objects from their surroundings (Lu et al., 2023; Liu et al., 2023). Another challenge is the need
for real-time object detection, which requires fast and efficient algorithms (He et al., 2019). This
can be especially challenging when dealing with large datasets or complex environments. To
overcome these challenges, researchers are developing new algorithms and techniques that are
specifically designed for object detection in aerial images. These include deep learning algorithms,
feature extraction techniques, and advanced image processing methods.
Ding et al. address these challenges and limitations of object detection in aerial images. The paper
focuses on the lack of large-scale benchmarks for object detection in aerial images and presents a
comprehensive dataset called DOTA (Dataset of Object Detection in Aerial images), which
contains over 1.7 million instances annotated by oriented bounding boxes (OBBs). The paper also
provides baselines for evaluating different algorithms and a code library for object detection in
aerial images. The research aims to facilitate the development of robust algorithms and
reproducible research in this field. The method used in this paper is a combination of CNN OBB
and RoI (region of interest) Transformer. It evaluates ten algorithms, including Faster R-CNN and
over 70 models with different configurations using the DOTA-v2.0 dataset. It achieves an OBB
mAP (mean average precision) of 73.76 for the DOTA-v1.0 dataset. The method outperforms
previous state-of-the-art methods, except for the one proposed by Li et al, 2022. It also
incorporates rotation data augmentation during training and shows significant improvement in
densely packed small instances.
37. 28
2.4 Real-Time Tracking and Detection on Edge Devices
"Edge devices" are hardware units that manage the transmission of data at the junction of two
networks, acting as gateways that moderate the flow of data. (Posey & Scarpati, n.d.). These
devices, which range from routers to multiplexers and integrated access devices, function to
streamline the communication between distinct networks. Training a CNN on edge devices refers
to using these devices to perform the complex tasks needed to train a CNN model. In the past, deep
learning models such as CNNs were typically trained in large data centers that had powerful GPUs.
However, due to the increasing need for immediate, on-device processing and the quick
advancements in hardware technology, there has been a growing preference to train these models
directly on edge devices (Ran, 2019).
Edge devices are important for allowing quick, real-time processing in object detection because
they are placed at key points where different networks meet. They handle data and do computing
jobs near where the data is coming from (like IoT devices) instead of using big, central data centers
(What Is Edge Computing, Architecture Difference and How Does Edge Computing Work? 2023).
The emergence of edge computing has brought about several advantages that are critical for
various applications mentioned based on Liang et al., 2022, and Amanatidis et al., 2023.
One of the key advantages of edge computing is its ability to minimize latency (Exploring the
Potential of Edge Computing: The Future of Data Processing and IoT, 2023). With edge
computing, data no longer needs to travel to a distant server for processing, which is particularly
useful for time-sensitive applications such as autonomous driving or real-time video analysis. This
is because the processing takes place closer to the data source, leading to faster response times and
greater efficiency.
Another significant advantage of edge computing is its ability to enhance privacy. Edge computing
can benefit applications dealing with sensitive data by keeping data on the device where it was
generated. This includes applications such as healthcare and finance, where data privacy is of
utmost importance. By ensuring that data remains on the device, edge computing can effectively
reduce the risk of data breaches and unauthorized access, which may occur during data
transmission.
38. 29
Edge computing can also help reduce network load. With the processing taking place on the device
itself, only relevant data needs to be transmitted, thereby reducing the amount of data that needs
to be sent over the network. This can lead to more efficient use of network resources, which can
be particularly useful in situations where network capacity is limited.
Finally, edge computing can enhance operability. Since the system can continue to operate even
when network connectivity is lost or unreliable, applications can continue to function even in
difficult network conditions. This can be critical for applications that require constant uptime, such
as those used in industrial settings.
However, training deep learning models on edge devices also comes with challenges. These
devices often have limited computational resources and power, so models must be efficient and
lightweight. This has sparked interest in areas such as model compression, network pruning, and
efficient neural architecture search (Falk, 2022).
2.5 Evaluation Metrics and Benchmarking
Evaluation metrics and benchmarking in object detection are processes that work hand in hand to
assess and compare the performance of various object detection algorithms or models.
Evaluation metrics provide the quantitative foundation upon which benchmarking is carried out.
These metrics, which include parameters such as precision, recall, and mean average precision
(mAP), offer a detailed insight into the individual performance of a model in terms of its accuracy,
ability to detect objects under varying conditions, and efficiency in distinguishing between
different objects within a given dataset. By assessing models based on these metrics, researchers
and developers can understand the strengths and weaknesses of a model in a quantitative manner.
When assessing the effectiveness of various models, benchmarking is the preferred technique. This
methodology considers the evaluation metrics and entails a comparative examination of multiple
39. 30
models to ascertain which one performs optimally in specific circumstances. Through
benchmarking, one can pinpoint the most appropriate model for a given task and fine-tune its
performance accordingly. This process helps in identifying the best-performing models based on
real-world testing scenarios and datasets and aids in fine-tuning existing models or developing
new models that are more efficient and accurate.
However, as (Padilla et al., 2021) highlight in their paper, there are challenges with the benchmark
and evaluation metrics, which vary significantly across different studies and can sometimes result
in confusing and potentially misleading comparisons. The identified challenges are differences in
bounding box representation formats and variations in performance assessment tools and metrics.
They continue to further their argument by the varying formats that different object detectors may
use to represent bounding boxes, such as using absolute coordinates or relative coordinates
normalized by the image size. This is important as later in the methodology of this research,
bounding boxes will be converted to fit YOLO and Detectron frameworks.
These variations in representation formats can make it difficult to compare and evaluate the
performance of different detectors.
As per the variations in performance assessment tools and metrics (Padilla et al., 2021) states that
each performance assessment tool implements a set of different metrics, requiring specific formats
for the ground-truth and detected bounding boxes. While there are tools available to convert
annotated boxes from one format to another, the lack of a tool compatible with different bounding
box formats and multiple metrics can lead to confusion and misleading comparative assessments.
To overcome the challenges in the benchmarking of datasets and evaluation metrics, the authors
have proposed the development and release of an open-source toolkit for object detection metrics
(Padilla et al., 2021). The toolkit includes various evaluation metrics, supports different bounding
box formats, and introduces a novel metric for video object detection. The metric considers the
motion speed of the objects and measures the average Intersection over Union (IOU) score between
the ground-truth objects and the detected objects in the current frame and nearby frames. This
metric provides a more comprehensive evaluation of video object detection performance compared
to traditional metrics like mean Average Precision (mAP).
40. 31
Some of the evaluation metrics considered in their study are average precision, intersection over
union, mean average precision and average recall.
Definitions of the evaluation metrics in deep learning are:
● Average Precision (AP) measures the precision of object detection at different levels of
recall. It calculates the area under the precision-recall curve and provides an overall
measure of detection accuracy (Anwar, 2022). To compute the area under the Precision-
Recall curve, the curve is divided into smaller segments, typically trapezoids. The area of
each segment, which forms a trapezoid between two adjacent points, is calculated. The
total area under the curve is obtained by summing the areas of all segments. Finally, to
derive the Average Precision, the total area is divided by the total range of recall. This
normalization ensures that the metric is not dependent on the specific recall range of the
dataset.
Figure 4. Illustration of precision-recall curve (Phung,2020)
41. 32
The resulting value provides an average assessment of how well the system performs across
various thresholds. It offers a comprehensive measure of the system's efficiency in ranking
relevant items (Precision-Recall — Scikit-Learn 1.3.1 Documentation, 2011).
The precision looks at the proportion of correctly predicted positive instances out of all
instances predicted as positive. The formula for the precision is:
On the other hand, recall measures the proportion of correctly predicted positive
instances out of all actual positive instances. The formula for the recall is:
The selection of Average Precision (AP) as the assessment metric in this thesis stems from its
widespread adoption and efficacy in appraising object detection performance. AP offers a
comprehensive evaluation of the model's capability to detect objects under varying conditions by
considering performance at different detection thresholds (Padilla et al., 2020). Furthermore, AP
grants recognition for situations in which a model identifies only a portion of an object. It merges
precision (the correctness of positive predictions) and recall (the proficiency to locate all pertinent
instances) into one metric. This holistic assessment proves vital in object detection scenarios where
both incorrect positive identifications and overlooked instances hold substantial importance.
Effective object detection mandates not only accurate categorization but also exact localization of
objects. Average Precision (AP) takes both factors into consideration, making sure that the model's
proficiency in accurately placing bounding boxes is evaluated (Garcia-Garcia et al., 2017).
● Intersection over Union (IOU) measures the overlap between the predicted bounding box
and the real bounding box (Anwar, 2022; Shah, 2022). It is a way to see how well the
42. 33
predicted boundary box lines up with the actual or true position of the boundary box. It is
used to determine whether detection is correct based on a predefined threshold. The
threshold is like a benchmark or standard used to decide whether the prediction was good
enough. If the IoU value meets or goes above this threshold, then we say it is a "true
positive," meaning it was a correct prediction. However, if it does not meet this benchmark,
then it's seen as a "false positive," a wrong prediction. The choice of threshold depends on
the specific task and can vary based on the model's accuracy expectation. In object
detection tasks, the IoU threshold also helps find the precision, which tells how many
correct positive predictions (true positives) we got out of all the positive predictions made
(which includes both the true positives and the false positives). By changing the IoU
threshold, the researchers can control the level of precision they get: a higher threshold
gives lower precision but is stricter about the accuracy, and a lower threshold gives higher
precision, being more generous in accepting predictions. This helps tailor the detection tool
to be as strict or as lenient as researchers want it to be.
Figure 5. Intersection over Union (Anwar, 2022)
● To compute the mAP, one considers several factors, including the confusion matrix — a
table that provides insight into the algorithm's performance — as well as IoU, which gauges
the algorithm's accuracy in object detection. Additionally, recall (the number of correctly
43. 34
identified actual positives) and precision (the number of correctly identified positives out
of all identified positives) are considered.
The mAP score not only examines the balance between precision and recall, seeking an
optimal equilibrium, but also addresses errors like false positives (incorrectly identifying
something as an object) and false negatives (failing to detect an actual object).
By accounting for all these facets, the mAP offers a comprehensive evaluation of the
algorithm's performance. This is why it is widely embraced by researchers in the field of
computer vision, serving as a reliable benchmark to gauge the robustness and dependability
of various object detection models.
According to (Shah, 2022) these are the steps to calculate mAP:
a. Generate prediction scores using the model.
b. Convert the prediction scores to class labels.
c. Calculate the confusion matrix, which includes true positives (TP), false positives (FP),
true negatives (TN), and false negatives (FN).
d. Calculate precision and recall metrics.
e. Calculate the area under the precision-recall curve.
f. Measure the average precision for each class.
g. Calculate the mAP by finding the average precision for each class and then averaging
over the number of classes.
44. 35
Figure 6. mAP flowchart calculation (Hofesmann, 2020)
Figure 7. Steps how to calculate mAP (Anwar, 2022)
● The confusion matrix is a tool used to evaluate the performance of object detection models.
It consists of four attributes: True Positives (TP), True Negatives (TN), False Positives
(FP), and False Negatives (FN). TP represents correct predictions, TN represents correct
non-predictions, FP represents incorrect predictions, and FN represents missed predictions.
These attributes help assess the accuracy and reliability of the model's predictions.
45. 36
Figure 8. Confusion Matrix (Shah, 2022)
● Average Recall (AR) measures the recall of object detection at different levels of precision.
It calculates the area under the precision-recall curve but focuses on recall instead of
precision.
2.6 Relevant Studies, Research Gaps and Future Directions
Numerous research initiatives have been carried out focusing on vehicle detection by applying
deep learning methods. These studies often incorporate aspects such as utilizing aerial images,
processing data in real-time, edge computing, and accounting for varying weather scenarios, while
also aiming to enhance existing algorithms with regards to speed and precision. However, this field
faces hurdles owing to the insufficient exploration of object detection under snowy settings. Based
on an extensive literature review, below are two recent studies that are more closely to the
investigation that our research is aiming for.
1. Li et al., 2023, aim to develop a deep learning-based target detection algorithm for aerial
vehicle identification. It focuses on the usage of small, low-altitude UAVs for aerial vehicle
detection. It addresses the challenge of precise and real-time vehicle detection in UAVs,
46. 37
emphasizing that most vehicle targets have few feature points and small sizes, making
detection difficult. The study customizes the YOLOv5 model by introducing additional
prediction heads to detect smaller-scale objects and incorporates a Bidirectional Feature
Pyramid Network. As a prediction frame filtering method, they employ Soft non-maximum
suppression. The experiments are conducted on a self-made dataset to evaluate the
performance of the customized YOLOv5-VTO, which they compare it with the
performance of the baseline YOLOv5s in terms of mean average precision at different IoU
thresholds. The results show improvements in mAP@0.5 and mAP@0.5:0.95, as well as
accuracy and recall. This study does not include the element of weather conditions and
edge-device processing.
2. Bulut et al., 2023 brings several new contributions to the research on object detection
models for traffic safety applications on edge devices. This study evaluates the
performance of the latest models, including YOLOv5-Nano, YOLOX-Nano, YOLOX-
Tiny, YOLOv6-Nano, YOLOv6-Tiny, and YOLOv7-Tiny on an embedded system-on-
module device that is known for performing edge computing tasks. The study does not
document the parameters of the edge device but mentions the key metrics for performance
comparison such as precision, power consumption, memory usage and inference time. The
study uses a dataset consisting of 11,129 images taken from highway surveillance cameras.
The dataset was shuffled to prevent models from memorizing sequential scenes. Average
precision (AP) was used as the key metric for performance evaluation, which measures
how many boxes are detected correctly out of the total number of detected boxes for a
given object. The models were tested with three consecutive runs for each metric, and the
final value was calculated using the average of the runs. The experimental results showed
that the YOLOv5-Nano and YOLOv6-Nano models were the strongest candidates for real-
time applicability, with the lowest energy consumption and fewer parameters. The
YOLOv6-Tiny model achieved the highest precision values, making it suitable for
applications where accuracy is critical. Again, this study is missing the element of weather
condition what our study is aiming to include.
Despite the undeniable improvements and contributions in the capabilities of deep learning in
object detections there is still room for further optimizations. To achieve this, future research could
47. 38
explore novel optimization techniques and fine-tune the training process to boost the performance
of models. Additionally, the use of different hardware and software configurations should be
evaluated to assess their impact on model applicability in real-life scenarios. Another area that
requires further investigation is the performance of different models or variations of existing
models. Researchers should aim to identify potential improvements or alternatives that could
enhance the effectiveness of these models in real-life applications. Furthermore, addressing
resource efficiency challenges in object detection models is crucial. This could involve developing
lightweight architectures or optimizing existing models to reduce computational power
requirements while maintaining accuracy and real-time applicability. Finally, is the clear effect of
weather conditions, especially snow, on the quality of images and, as a result, how well detection
models work. Current research struggles with a noticeable lack of data about snowy conditions,
showing a strong need to include weather factors in future studies. Adding this information would
not only widen the research area but also be a step towards creating a model that remains strong
and unaffected by the challenges of bad weather.
48. 39
CHAPTER 3
Methodology
3. Thesis Methodology Approach
3.1 Dataset Acquisition and Description
As this thesis builds upon Mokayed et al., 2023, the data capturing, and preparation are well
documented in their paper. Here is a summary of the process. The dataset on which all models are
trained is called the Nordic vehicle dataset (NVD). The dataset has videos/images captured from
UAV, containing vehicles in random areas and with different snow coverage. It was captured using
a "Smart planes Freya unmanned aircraft system" drone.
The dataset is of a heterogeneous nature, including different ranges of snow coverage, disparities
in resolution, and aerial views from varying altitudes ranging between 120 to 250 meters. The
videos, captured at a speed of 13m/s (47 km/h), are characterized by different features, including
varied resolutions and stabilization statuses; not all are annotated and stabilized yet. The Smart
planes Freya drone facilitated the data collection. Its technical features, which include a wingspan
of 120 cm and a cruise speed of 13m/s, were essential in the acquisition of high-quality data. The
UAV’s camera specifications are equally noteworthy, having a sensor of 1.0-type (13.2 x 8.8 mm)
Eximor RS CMOS and capable of recording videos at 1080p or 4K at 25 frames per second,
lending depth and clarity to the recorded values.
49. 40
Figure 9. Showcase of dataset images
Despite the diversity in the data, a selection was made that focused on a process to narrow down
the usable data to stabilized and annotated videos that conform to a 1920X1080 resolution and
with snow coverage ranging between 0-10 cm, as described in detail in Table.3. The selection
process was attentively done, leading to the selection of 5 videos specifically for the purposes of
this thesis, out of the available 22, which were further divided into training and testing sets and all
selected models were trained on the same split. This methodological decision was made to ensure
a fair comparison of the models, eliminating any potential influence from differences in the
dataset's quality or inconsistencies in the selection and division of videos. Mokayed et al. used
this specific subset and split to maintain consistency across our Convolutional Neural Network
(CNN) and their respective models.
A significant part of the data preparation involved the extraction of individual frames from the
chosen videos, each accompanied by a respective file featuring bounding box annotations, curated
through the CVAT tool as referenced in the Hama et al. paper. These annotations are further
transformed to accommodate YOLO model prerequisites. The network was trained using heatmaps
for our Convolutional Neural Network (CNN) model, which necessitated generating them from
the specific frames and corresponding annotation files.
Table 1. Specification of Freya unmanned aircraft Mokayed et al.
50. 41
Table 2. Specification of Freya unmanned aircraft Mokayed et al.
3.2 Data Splitting
In this research, we have gathered all the annotated videos from the NVD. These videos have a
resolution of 1920x1080 and depict various flight altitudes, snow covers, and cloud conditions.
We have split the dataset according to the same methodology used by Mokayed et al., as shown in
the Table 3. This enables us to train and benchmark our model on the same set of images.
The dataset has been split so that the training and testing sets have an equal distribution of snow a
covers and flight altitudes. This allows for a balanced performance evaluation of our model. The
train/test split is around 72:28. However, we have further split the training set by creating a
validation set. This set follows an 80:20 scale of training to validation. Therefore, the actual split
is 57:14:28 (train/validation/test). Videos are also not split between datasets; they either are a part
of a training dataset or a testing dataset. This ensures that the model never previously saw cars
from the testing dataset.
The training set is used to train the model, and the model sees the labels during this process. The
validation set is used after training in each epoch. The model does not see the labels directly but
indirectly, as we use it for benchmarking between model iterations during training, and all model
parameters are adjusted accordingly. It also helps us not to overfit the model. Lastly, the testing
set is used for benchmarking between different models (such as YOLO, Fast-R CNN, and CNN)
and is used after training. The model does not see any labels from the testing set, even indirectly.
Overall, the proposed approach ensures a fair and unbiased evaluation of this research model's
performance.
51. 42
Table 3. Summary of NVD dataset processing
3.3 Frame and Annotation Processing
The process of frame selection and annotation is pivotal to the detection model functionality as the
precision and the performance are tied to the quality and the appropriateness of its training data
(Liu et al., 2016; Ma et al., 2022).
For model training, it is essential to extract individual annotated images from the annotated videos
as direct video import into the network is incompatible. An annotation file, typically in .xml
format, provides a detailed record of each vehicle across all frames in which it appears. Each
annotation indicates the vehicle's location through the top-left and bottom-right coordinates of its
bounding box, accompanied by its rotation (Russakovsky et al., 2015).
The first step involves segmenting the video into individual frames. Subsequently, an individual
annotation file is generated for every frame derived from the video. Prior to this segmentation,
there was a need to convert the bounding box annotation to be compatible with YOLO and
Detectron-styled format.
For YOLO models, the annotation format is modified to include the top-left coordinate along with
the width and height of the bounding box. Importantly, these coordinates are not absolute values
but in proportion to the size of the picture. For instance, within an image measuring 1000x1000
pixels, a coordinate that originally reads as [200,200] would be represented as [0.2,0.2] (Redmon
et al., 2016).
52. 43
On the other hand, the Faster-RCNN model requires absolute coordinates, but we transform them
into top-left and bottom-right coordinates. It may seem like the original annotation, but the original
annotation had rotation, so it was capturing the car more precisely. Consequently, the bounding
boxes tend to be larger with the rotation parameter discarded in the converted annotation. Although
this implies a less precise encapsulation of the object, the models mentioned require such a format
(Detectron2.data. transforms — Detectron2 0.6 Documentation, n.d.).
Figure 10. Difference between inner bounding boxes with rotation and outer bounding boxes without rotation
When it comes to our CNN model, we have diverged from the traditional approach of using
bounding boxes as labels; instead, we propose the utilization of heatmaps. The heatmaps serve as
single-channel images with increased intensity in areas where the object (vehicle) is present and
lower or 0 intensity in areas where the object is not present (Huang et al., 2022). The generation
of these heatmaps is done through the application of a Gaussian elliptical function. The Gaussian
function has been selected to facilitate a gradual increase in pixel intensity (a proxy for the
likelihood of the presence of a vehicle) as one moves toward the central region of the car. This
feature ensures smoothness for the gradient descent during the testing process (Huang et al., 2022).
Considering the rectangular shape of cars, we opted for an elliptical function, which entails setting
one dimension of the Gaussian function with a higher sigma value compared to the other. This
approach allows us to better represent the vehicles' shape in the analysis. Furthermore, we rotated
this elliptical Gaussian function using values derived from the original annotations to ensure an
optimal fit to the vehicle's orientation.
53. 44
Figure 11. Illustration of gaussian elliptical function over heatmap (Probability - How to Get a "Gaussian
Ellipse",2020)
To accommodate images featuring multiple vehicles, we devised a method where individual
Gaussian elliptical functions representing each vehicle are normalized to a scale ranging from 0 to
1 and do an addition of all functions. However, we have identified a challenge with this technique
as when vehicles in the image are positioned closely to each other, the Gaussian functions overlap
significantly, leading to higher peaks in the heatmap and potentially overshadowing vehicles
positioned farther apart. To make this right, we refined our method by reducing the radius of
functions by reducing sigma values, reducing the extent to which individual Gaussian functions
influence each other. This modification reduces the problem to just a few samples on the whole
dataset.
The resulting heatmap is normalized again to <0,1> scale with the precision of a data type float32.
Increasing the precision into the datatype of float64 would “smoothen '' the curve, but it is not
feasible for training due to the marginal increase in training of a model and our model structure.
To retaining the precision of the coordinates or pixels, we store the heatmaps as numpy arrays
rather than a .png image. While we do maintain a repository of heatmap representations in .png
format, it is imperative to note that these are for illustrative purposes and are not utilized in the
training process. As a result of this process, we also have a separate annotation file with the
required formats and a heatmap for each image in the dataset.
54. 45
Figure 12. From left to right: original image, produced heatmaps, overlay of both
3.4 Training Techniques, Models
The training techniques differ for every model type, and thus, the process of each training is
explained separately.
For retraining of YOLO models, an official Ultralytics library was used (Ultralytics YOLOv8 Docs,
n.d.), where the images were fed into the network with its corresponding YOLO formatted
bounding box annotations. The “s” size models were chosen as they are the largest models the
edge devices upon which this research is based could handle. Before initiating YOLO model
training, data augmentation on the images was applied, a strategy implemented to expand the
applicability and improve the precision of the model. The augmentation process occurred during
online training, leveraging both the YOLO’s inherent augmentation feature and the
Albumentations library.
Figure 13. Example of Yolo model sizes (Train Custom Data, 2023)
55. 46
To facilitate the retraining of Detectron models, the recognized Detectron2 library was utilized,
facilitating the integration of images and corresponding Detectron formatted bounding box
annotations into the system (Sethi, n.d.). Official model weights of “faster_rcnn_R_50_FPN_3x”
from detectron were used as a baseline for retraining on the NVD dataset. The model training and
hyperparameter tuning were undertaken in collaboration with the Mokayed NVD team to compare
the performance and add to their research with our research on edge devices. For YOLO models,
image augmentation techniques were also used according to Mokayed et al., 2023 and codebase
specifications using Albumentations library. In terms of training our proprietary CNN model, there
was no pre-defined framework or documentation to guide the process. Primarily, the PyTorch
library was utilized to facilitate the development of a customized training methodology.
At first, the CNN architecture that can tackle this problem was devised:
Figure 14. CNN network architecture made as pytorch nn.Module.
56. 47
The architecture of the network follows a so-called encoder-decoder framework. Initially,
convolutional layers functioning as encoders decrease spatial dimensions and facilitate feature
extraction. Subsequently, upsampling layers, also known as transpose layers, assume the role of
decoders, which increase the spatial dimensions and generate output.
The convolution operation is first applied, followed by ReLU activation on the input data. After
each convolution operation, a max pooling layer is performed to reduce the spatial dimensions of
the feature maps by a factor of 2. This cycle is repeated four times. The ReLU activation function
is utilized due to its ability to mitigate the vanishing gradient problem and accelerate training.
In the upsampling layer, a series of transpose convolutions are used. They are slightly different
from the classical upsample layer, which has a series of simple interpolations. The transpose layer
also has trainable parameters, which could improve the performance. These layers reduce the
channels and increase spatial dimensions, which in the end generates our predicted heatmap.
Figure 15. Our CNN network architecture