This poster presents the use of a convolutional neural network and a virtual environment to detect stop signs and estimate distances to them based on individual images. To train the network, we develop a method to automatically collect labeled data from Grand Theft Auto 5, a video game. Using this method, we collect a dataset of 1.4 million images with and without stop signs across different environments, weather conditions, and times of day. Convolutional neural network trained and tested on this data can detect 95.5% of the stops signs within 20 meters of the vehicle with a false positive rate of 5.6% and an average error in distance of 1.2m to 2.4m on video game data. We also discovered that the performance our approach is limited in distance to about 20m. The applicability of these results to real world driving is tested, appears promising and must be studied further.
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Learning to Recognize Distance to Stop Signs Using the Virtual World of Grand Theft Auto 5 (Poster)
1. Learning to Recognize Distance to Stop Signs Using
the Virtual World of Grand Theft Auto 5
Artur Filipowicz, Jeremiah Liu, Alain Kornhauser
Operations Research and Financial Engineering
Princeton University
The Problem
Detect the presence of a stop sign and determine the distance to it based on an
image.
Related Works
• Method employs both single-image and multi-view analysis for a 97% traffic sign
classification rate (1).
• A neural network trained to perform traffic sign classification on images obtains an
accuracy of 95% (2).
• (1) uses multiple views to locate 95% of signs within 3 m. of the real location at 2
images per second.
• Algorithms in (3) can determine traffic sign position with error between 0.2 m and
1.6 m within the range of 7 m to 25 m from the stop sign.
The Proposed Solution
• Generate a dataset of synthetic autonomically labeled driving scenes using the
video game Grand Theft Auto 5.
• Train a convolutional neural network (CNN) with an architecture from (4) which
runs at 10 images per second.
Image: 𝑿∈ℝ
Output: 𝒀﹦[Ⅱ , 𝒅 ]
Network: ⨍(𝑿)→𝒀
𝒉∗𝒘∗𝒄
Stop Stop
STOP ⅡStop
𝒅Stop
Camera
Vehicle
Grand Theft Auto 5
• Video game with a rich road environment
• Vehicles, pedestrians and animals
• Road network of bridges, tunnels, freeways, and intersections
• Urban, suburban, rural, desert and woodland environments
• 14 weather conditions
• Lighting conditions for 24 hours of the day
Figure 1: Diagram of the problem.
Figure 2: Driving scene in different weather and lighting
conditions generated from Grand Theft Auto 5.
Figure 3: A stop sign 10 meters away in Grand Theft Auto
5 (left) and the real world (right).
• Using Grand Theft Auto 5, we’ve generated a dataset of 1.4 million images with
and without stop signs in different light and weather conditions and locations.
Result
• The convolutional neural network is accurate on game images as well as on real
world images.
Table 1: Performance on images from Grand Theft Auto 5
Range Accuracy
False Negative
Rate
False Positive
Rate
Mean AE (m) Median AE (m)
Mean AE (m)
when correct
Median AE (m)
when correct
0m - 10m 0.961 0.039 na 2.2 0.9 1.2 0.8
10m - 20m 0.949 0.051 na 3.3 1.7 2.4 1.6
20m - 30m 0.798 0.202 na 4.7 3.4 3.1 2.7
30m - 40m 0.440 0.560 na 3.4 2.6 3.1 2.1
> 40m 0.944 na 0.056 1.8 0.2 0.9 0.2
na = not applicable
Table 2: Performance on 200 real world images.
Range Accuracy
False Negative
Rate
False Positive
Rate
Mean AE (m) Median AE (m)
Mean AE (m)
when correct
Median AE (m)
when correct
0m - 10m 1.000 0.000 na 9.1 8.9 9.1 8.9
10m - 20m 0.750 0.25 na 16.4 16.3 15.1 15.7
20m - 30m 0.968 0.032 na 10.0 10.9 9.8 10.8
30m - 40m 0.687 0.313 na 5.9 4.2 7.5 6.1
> 40m 1.000 na 0.000 0.9 0.4 0.9 0.4
na = not applicable
Conclusions
• Virtual environment enables creative data collection methods
• CNN can detect 95.5% of the stops signs within 20 meters with an average error
in distance of 1.2m to 2.4m on video game data
• Need for research on real world adaptation
• Need for research on optimal use of simulators
References
• [1] Radu Timofte, Karel Zimmermann, and Luc Van Gool. Multi-view traffic sign detection, recognition, and 3d
localisation. Machine Vision and Applications, 25(3):633–647, 2014.
• [2] Arturo De La Escalera, Luis E Moreno, Miguel Angel Salichs, and José María Armingol. Road traffic sign
detection and classification. IEEE transactions on industrial electronics, 44(6):848–859, 1997.
• [3] André Welzel, Andreas Auerswald, and Gerd Wanielik. Accurate camera-based traffic sign localization. In 17th
International IEEE Conference on Intelligent Transportation Systems (ITSC), pages 445–450. IEEE, 2014.
• [4] Chenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong Xiao. Deepdriving: Learning affordance for direct
perception in autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision,
pages 2722–2730, 2015.
Acknowledgment:
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.