The document summarizes a convolutional neural network model called YOLO that performs real-time road sign detection in one stage. It divides the input image into a 7x7 grid, with each grid predicting 2 bounding boxes and confidence scores. It is trained on a dataset of 484 stop signs and 284 yield signs over 5000 batches, and tests at over 24 frames per second on videos with an overall accuracy of 92.5%, detecting stop signs at 90% accuracy and yield signs at 95% accuracy.
1. We followed the default network setting [1], using S = 7, B =2 to predict 20 classes.
Convolutional Neural Networks for Real-time Road Sign Detection
Guangrui Liu, Maryam Rahnemoonfar
College of Science & Engineering, Texas AM University – Corpus Christi
Optical vision is an essential ability of future autonomous car. Accurate object detection, such
as vehicles, buildings, pedestrians and road signs, has been a challenging task for decades
because images in real world environment are variant to illumination, rotation, scale and
occlusion. In recent years, many classification-after-localization methods based on
Convolution Neural Network (CNN) show an outstanding detection precision in various
conditions, as long as the training image dataset includes enough multi-circumstance samples.
However, the slow recognition speed of these two-stage methods limits their usage in real
time situation. In this paper, we will implement a novel one-stage CNN Structure to perform
real-time road sign detection task. The model could find out road sign’s position and category
at the same time [1]. At the end, we evaluated the accuracy in images and the speed in videos.
Abstract
Network Architecture
Figure 1 The model
Method
The novel CNN model [1], YOLO, treats detection problem as a regression problem.
Dividing input image into S x S grids.
Each grid predicts:
• B boxes (width, height, x, y)
• B confidence scores P(Object)
- possibility of containing objects
• C conditional class probabilities P(Classi | Object)
- conditioned on the grid containing an object
Regression from image to S x S x ( B x 5 + C ) tensor
At test time, computes class-specific confidence scores for each box by
P(Classi) = P(Classi | Object) * P(Object)
For a given threshold Pmin, outputs detection objects when P(Classi) >= Pmin.
Training and result
• The training dataset contains 484 stop sign images and 284 yield sign images
• Based on a extraction model which is pre-trained on ImageNet
• 5000 batches, 128 images per batch.
• Training process lasted for 7 hours on a single Nvidia GTX 980 Ti video card
(1) Then the training model was tested with 40 images, stop sign and yield sign each for 20.
(2) Also tested the model on videos, it predicted the result at more than 24 frames per second.
Pros:
Yolo model is an efficient objects detector in terms of accuracy and speed
The model can detect any object if enough training data is provided
The fast objection model has huge potential in real time applications, such as autonomous
driving, home security system or live video stream surveillance.
Furthermore, this model brings a new idea that objection problem can be solved as
regression problem rather than classification problem.
Cons:
× Hard to detect small group objects since each grid only predicts two boxes and one class
× It struggles with detecting object with new or unusual aspect ratios
Conclusion
Overall accuracy Stop sign accuracy Yield sign accuracy
92.5% is 90% 95%
Figure 2 Network Architecture
Figure 3 Testing results. Left 5 columns are right detection, right 1 column are failed results.
1. Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." arXiv preprint:1506.02640 (2015).
Reference