Video Games for Autonomous Driving

Driving School II
Video Games for Autonomous Driving
Independent Work
Artur Filipowicz
ORFE Class of 2017
Advisor Professor Alain Kornhauser
arturf@princeton.edu
May 3, 2016
Revised
August 27, 2016
1

Abstract
We present a method for generating datasets to train neural networks and other statistical
models to drive vehicles. In [8], Chen et al. used a racing simulator called Torcs to
generate a dataset of driving scenes which they then used to train a neural network. One
limitation of Torcs is a lack of realism. The graphics are plain and the only roadways
are racetracks, which means there are no intersections, pedestrian crossings, etc. In this
paper we employ a game call Grand Theft Auto 5 (GTA 5). This game features realistic
graphics and a complex transportation system of roads, highways, ramps, intersections,
traﬃc, pedestrians, railroad crossings, and tunnels. Unlike Torcs, GTA 5 has more car
models, urban, suburban, and rural environments, and control over weather and time.
With the control of time and weather, GTA 5 has an edge over conventional methods of
collecting datasets as well.
We present methods for extracting three particular features. We create a function for
generating bounding boxes around cars, pedestrians and traﬃc signs. We also present
a method for generating pixel maps for objects in GTA 5. Lastly, we develop a way to
compute distances to lane markings and other indicators from [8]
2

Acknowledgments
I would like to thank Professor Alain L. Kornhauser for his
mentorship during this project and Daniel Stanley and Bill
Zhang for their help over the summer and last semester.
This paper represents my own work in accordance with University
regulations.
Artur Filipowicz
3

Contents
1 From The Driving Task to Machine Learning 6
1.1 The Driving Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 The World Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Datasets for the Driving Task 10
2.1 Cars, Pedestrians, and Cyclysis . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Lanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Observations on Current Datasets . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Video Games and Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Sampling from GTA 5 12
3.1 GTA 5 Scripts Development . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Test Car . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Desired Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Screenshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Bounding Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5.1 GTA 5 Camera Model . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.2 From 3D to 2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.3 General Approach To Annotation of Objects . . . . . . . . . . . . 20
3.5.4 Cars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.5 Pedestrians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.6 Signs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.6 Pixel Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.7 Road Lanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.7.1 Notes on Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.7.2 Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.7.3 Road Network in GTA 5 . . . . . . . . . . . . . . . . . . . . . . . 29
3.7.4 Finding the Lanes . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 Towards The Ultimate AI Machine 40
4.1 Future Research Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
A Screenshot Function 42
List of Figures
1 Graphics and roads in Torcs. . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Graphics and roads in GTA 5. . . . . . . . . . . . . . . . . . . . . . . . . 12
4

3 Test Vehicle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 The red dot represents camera location. . . . . . . . . . . . . . . . . . . . 15
5 Camera model and parameters in GTA 5 . . . . . . . . . . . . . . . . . . 18
6 Two cars bounded in boxes. Weather: rain. . . . . . . . . . . . . . . . . 21
7 Two cars bounded in boxes. . . . . . . . . . . . . . . . . . . . . . . . . . 21
8 Traffic jam bounded in boxes. . . . . . . . . . . . . . . . . . . . . . . . . 22
9 Pedestrians bounded in boxes. . . . . . . . . . . . . . . . . . . . . . . . . 22
10 Some of the traffic signs present in GTA 5. . . . . . . . . . . . . . . . . . 23
11 Stop sign in bounding box. . . . . . . . . . . . . . . . . . . . . . . . . . . 24
12 Traffic lights in bounding boxes. . . . . . . . . . . . . . . . . . . . . . . . 24
13 Image with a bounding box. . . . . . . . . . . . . . . . . . . . . . . . . . 27
14 Image with a pixel map for a car applied. . . . . . . . . . . . . . . . . . . 28
15 List of indicators, their ranges and positions. Distances are in meters, and
angles are in radians. Graphic reproduced from [8]. . . . . . . . . . . . . 30
16 Flags for links. [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
17 Flags for nodes. [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
18 Blue line represents where we want to collect data on lane location. . . . 32
19 Red markers represent locations of vehicle nodes. . . . . . . . . . . . . . 36
20 Red markers represent locations of vehicle nodes. Blue markers are ex-
trapolations of lane middles based on road heading and lane width. . . . 37
21 Red markers represent locations of vehicle nodes. Blue markers are ex-
trapolations of lane middles based on road heading and lane width. The
blue marker in front of the test car represents where we want to measure
lanes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
22 Node database entry design. . . . . . . . . . . . . . . . . . . . . . . . . . 39
23 GTA V Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 41
5

1 From The Driving Task to Machine Learning
1.1 The Driving Task
The driving task is a physics problem of moving an object from point a ∈ R4
to b ∈ R4
,
with time being the fourth dimension, without colliding with any other object. There
are also additional constraints in the form of lane markings, speed limits, and traffic flow
directions. Even with all constraints beyond avoiding collisions, the physical problem of
finding a navigable path is easy given a model of the world. That is, if the location of all
objects and their shapes is known with certainty and the location of the constraints is
known, then the task becomes first the computation of a path in a digraph G representing
the road network and then for each edge finding unoccupied space and moving the object
into it. All of these problems can be solved using fundamental physics and computer
science. What makes the driving task difficult in the real world setting is the lack of an
accurate world model. In reality we do not have omniscient drivers.
1.2 The World Model
People drive, and so do computers to a limited extent. Therefore, omniscience is not
necessary. Some subset of the total world model is good enough to perform the driving
task. Perhaps with limited knowledge, it is only possible to successful complete the task
with a probability less than 1, but the success rate is high enough for people to utilize
this form of transport.
To drive, we still need a world model. This model is constructed by the means of sensor
fusion, the combination of information from several different sensors. In 2005, Princeton
University’s entry in the DARPA Challenge, Prospect 11, used radar and cameras to
identify and locate obstacles. Based on these measurements and GPS data, the on-board
computer would create a world model and find a safe path. [4] In a similar approach, the
Google Car uses radar and lidar to map the world around it. [14]
Approaches in [4], [14], and [29] appear rather cumbersome and convoluted compared
to the human way of creating a world model. Humans have 5 sensors, the eyes, the
nose, the ears, the mouth, and the skin. In driving neither taste nor smell nor touch are
used to build the world model as all of these senses are mostly cut off from the world
outside the vehicle. The driver can hear noises from the outside. However, they can be
muffled by the sound of the driver’s own vehicle and many important objects, such as
street signs and lane markings, do not make noise. To construct the world model humans
predominantly use one sensor, the eyes. We can suspect that there is enough information
encoded in visible light coming through the front windshield to build a world model good
enough for completing the driving task. However, research on autonomous vehicles - the
construction of solution to the driving task using artificial intelligence - stays away from
6

approaching the problem in the pure vision way, as noted in [4] and [29]. The reason for
this is that vision, computer vision in particular, is difficult.
1.3 Computer Vision
Let X ∈ Rh∗w∗c
be an image of width w and height h and c colors. As we stated earlier,
X has enough information for a human to figure out where lane markings and other
vehicles are, identify and classify road signs and perform other measurements to build a
world model. Perhaps, maybe several images in a sequence are necessary, although [8]
shows that one image can be used to extract a lot of information. The difficult of computer
vision is that X is a matrix of numbers representing colors of pixels. In this representation
an object can appear very different depending on lighting conditions. Additionally, due
to perspective, objects appear in different sizes and therefore occupy different number of
pixels, even if the object is the same. These are two of many variations which humans
can account for, but naive machine approaches fail.
Computer vision is difficult but not impossible. In recent decades, researches used ma-
chine learning to enable computers to take X and construct more salient representations.
1.4 Machine Learning
The learning task is as follows; given some image Xi, we wish to predict a vector of
indicators Yi. Yi could be distances to lane markings, vehicles, locations of street sings
etc. and can then be used to construct a world model. To that end, we want to train a
function f such that Yi = f(Xi). We say that Xi, Yi ∼ PX,Y .
The machine learning approach to this problem mimics humans in more then just the
focus on visual information. The primary method of learning images is the use of neural
networks, more specifically convolutional neural networks. These statistical models are
inspired by neurons which make up the nerves and brain areas responsible for vision.
The mathematical abstraction is represented as follows:
Let f(xi, W) be a neural network of L hidden layers. The sizes of these layers are l1 to
lL.
f(xi, W) = gL(W(L)...g3(W(3)g2(W(2)g1(W(1)xi)))...)
W = {W(1), W(2), ...W(L)}
W(i) ∈ Rli+1× li
gi(x) is some activation function.
7

The process of training becomes the process of adjusting values of W(i). This first
requires some loss function which expresses the error made by the network. A common
loss function is L2. We wish to create a neural network model f such that
min
f
L2(T , f)
where let D be a dataset of n indicator Yi and image Xi pairs
D = {(Xi, Yi)}n
i=1
and let R be the training set and let T be the test set.
R ⊂ D
T ⊂ D
R ∩ T = Ø
R ∪ T = D
|R| = r
|T | = t
To minimize the loss function with respect to W, the most common method is the
use of Back-Propagation Algorithm [27]. Back-Propagation Algorithm uses stochastic
gradient decent to find a local minimum of a function. At each iteration j of J, Back-
Propagation Algorithm updates W
Wj+1 = Wj − η
∂E(W)
∂Wj
The two sources of randomness in the algorithm are W0 and the order in which training
examples are used π. The initial values of element in matrices in W0 are uniform random
variable. The ordering of examples is also often a random sample with replacement of J
(xi, yi) ∈ R
On an intuitive level, the network adjusts W to extract useful features from the image
pixel values Xi. In the process it builds the distribution Xi, Yi ∼ PX,Y . In theory the
larger the W the more capacity the network has for extracting and leaning features and
representing complex distributions. At the same time, it is also more likely to fit noise in
the data a nonsalient features such as clouds. This overfitting causes poor generalization
and we need a network which can generalize to many driving scenes. There are several
regularization techniques to overcome overfitting. These include L1, L2, dropout, and
others. However, these will only be effective if we do have the data adequately represent
the domain of PX,Y . This domain for driving scenes is huge considering it includes
8

images of all the diﬀerent kinds of roads, vehicles, pedestrians, street signs, traﬃc lights,
intersections, ramps, lane marking, lighting conditions, weather conditions, times of day
and positions of the camera. [8] tested a network in a limited subset of these conditions
and they used almost half million images for training.
9

2 Datasets for the Driving Task
Machine learning for autonomous vehicles has been studied for years. Therefore, several
datasets already exist. These datasets come in two types. There are datasets of objects
of interest in the driving scenes which include vehicles (cars, vans, trucks), cyclists,
pedestrians, traffic lights, lane markings and street signs. Usually, these datasets provide
coordinates of bounding boxes around the objects Yi. These are useful for training
localization and classification models. The second type of datasets provide distances to
lane markings, cars and pedestrians. These are used to train regression models. Here we
will give a brief overview of several of these datasets.
2.1 Cars, Pedestrians, and Cyclysis
Daimler Pedestrian Segmentation Benchmark Dataset contains 785 images of pedestri-
ans in an urban environment captured by a calibrated stereo camera. The groundtruth
consists of true pixel shape and disparity map. [11]
CBCL StreetScenes Challenge Framework contains 3,547 images of driving scenes cap-
tured with bounding boxes for 5,799 cars, 1,449 pedestrians, 209 cyclists, as well as
buildings, roads, sidewalks, stores, tree, and the sky. The images have been captured by
photographers from street, crosswalk, and sidewalk views. [7]
KITTI Object Detection Evaluation 2012 contains 7481 training images and 7518 test
images with each image containing several objects. The total number of objects is 80,256,
including cars, pedestrians, and cyclists. The groundtruth includes a bounding box for
the object as well as an estimate of the orientation in the bird’s eye view. [13]
Caltech Pedestrian Detection Benchmark contains 10 hours of driving in an urban
environment. The groundtruth contains 350,000 bounding boxes for 2300 unique pedes-
trians. [9]
There are several datasets for street signs [23], [21], and [15]. However, these datasets
have been made in European countries and therefore they contain European signs which
are very different from their US counterparts. Luckily [24] is a dataset of 6,610 images
containing 47 different US road signs. For each sign the annotation includes sign type,
position, size, occluded (yes/no), and on side road (yes/no).
2.2 Lanes
KITTI Road/Lane Detection Evaluation 2013 has 289 training and 290 test images
of road lanes with groundtruth consisting of pixels map the road area and the lane the
10

vehicle is in. The dataset contains images from three environment urban with unmarked
lanes, urban with marked lanes and urban with multiple marked lanes. [12]
ROMA lane database has 116 images of different roads with groundtruth pixel positions
of visible lane markings. The camera calibration specifies the pixel distance to true
horizon and conversions between pixel distances and meters. [30]
2.3 Observations on Current Datasets
The above datasets are quite limited. First, most of them are small when compared to
the half a million images used in [8]. Second, they do not represent many of the driving
conditions such as different weather conditions or times of day; the reason for this is that
measuring equipment, especially cameras, can only function in certain conditions. Since
this tends to be sunny weather, most of these datasets are collected during such times.
Additionally, all of these datasets include some amount of manual labeling which is not
feasible when the dataset includes millions of images.
2.4 Video Games and Datasets
The problems associated with the datasets would be resolved if we could somehow
sample from PX,Y both Xi and Yi without having to spend time to measure Yi. This is
not possible in the real world. However, [8] decided to use a virtual world, a racing video
game called Torcs [6]. The hope behind this approach is that the game can simulate
PX,Y well enough so that the network, once trained, will be able to generalize to the real
world. Let us assume that this is true.
The main benefit of using Torcs and other video games is access to the game engine.
This allows us to extract the true Yi for each Xi we harvest from the screen. Torcs itself
has several restrictions which limit it from simulating the range of driving conditions
present in the real world. Fundamentally it is a racing game with circular, one-way
tracks. The weather and lighting conditions are fixed. The textures are rather simple
and thus unrealistic.
To overcome these limitations and allow for a more diverse and realistic dataset, we
focus on the game called Grand Theft Auto 5 (GTA5). Unlike Torcs, the makers of GTA5
had the funds to create a very realistic world since they were developing a commercial
product and not an open-source research tool. GTA5 has hundreds of different vehicles,
pedestrians, freeways, intersections, traffic signs, traffic lights, rich textures, and many
other elements which create a realistic environment. Additionally, GTA5 has about 14
weather conditions and simulates lighting conditions for 24 hours of the day. To tap into
these features, the next section examines ways of extracting various data.
11

Figure 1: Graphics and roads in
Torcs.
Figure 2: Graphics and roads in
GTA 5.
3 Sampling from GTA 5
3.1 GTA 5 Scripts Development
GTA 5 is a closed source game. There is no out-of-the-box access to the underlying
game engine. However, due to the game’s popularity, fans have hacked into it and
developed a library of functions for interacting with the game engine. This is done
by the use of scripts loaded into the game. The objective of this paper is not to give
tutorial on coding scripts for GTA 5, and as such we will keep the discussion of code
to a minimum. However, we will explain some of the code and game dynamics for the
purpose of reproducibility and presentation of the methods used to extract data.
Two tools are needed to write scripts for GTA 5. The ﬁrst tool is ScritHook by Alexan-
der Blade. This tool can be downloaded from: https://www.gta5-mods.com/tools/script-
hook-v or http://www.dev-c.com/gtav/scripthookv/. It comes with a useful trainer
which provides basic control over many game variables including weather and time. The
next tool is a library called Script Hook V .Net by Patrick Mours which allows us to use
C# and other .Net languages to write scripts for GTA 5. The library can be downloaded
from https://www.gta5-mods.com/tools/scripthookv-net. For full source code and list
of functions please see https://github.com/crosire/scripthookvdotnet.
3.2 Test Car
To make the data collection more realistic we will use an in-game vehicle, the test car,
with a mounted camera; similar to [13]. The vehicle model for the test car was picked
arbitrarily and can be replaced with any other model. Besides the steering controls, we
introduce 3 new functions for the following keys: NumPad0, ”I”, and ”O”. NumPad0
spawns a new instance of our test car. ”I” mounts the rendering camera on the test car.
12

Figure 3: Test Vehicle
”O” restores the control of the rendering camera back to the original state. Let us look
at the some of the code for the test car.
The TestVehicle() function is a constructor for the TestVehicle class. It is called once
when all of the scripts are loaded. This occurs at the start of the game and can be
triggered at any point in the game by hitting the ”insert” key. This constructor gains
control of the camera which is rendering the game by destroying all cameras and creating
a new rendering camera. The function responsible for this is World.CreateCamera. The
first two arguments represent position and rotation. The last argument is the field of
view in degrees. We set it to 50, however this could be changed to fit the parameters of
a real world camera.
It is important to note GTA.Native.Function.Call. GTA 5’s game engine has thousands
of native functions which were used by the developers to build the game. This library
encapsulates some of them. Others can be called using GTA.Native.Function.Call where
the first argument is the hash code of the function to call and the remaining arguments
are the arguments to pass to the native function. One of the biggest challenges in this
project is figuring out what these other arguments represent and control. There are
13

online databases where players of the game list known functions and parameters. These
databases are far from complete. Therefore, for some of these native function calls, some
of the arguments may not have any justiﬁcation besides that they make the function
work. This is the price paid for using a closed source game.
public TestVehicle()
{
UI.Notify("Loaded TestVehicle.cs");
// create a new camera
World.DestroyAllCameras();
camera = World.CreateCamera(new Vector3(), new Vector3(), 50);
camera.IsActive = true;
GTA.Native.Function.Call(Hash.RENDER_SCRIPT_CAMS, false, true,
camera.Handle, true, true);
// attach time methods
Tick += OnTick;
KeyUp += onKeyUp;
}
The camera position and rotation do not matter in the previous function as they will
be dynamically updated to keep up with the position and rotation of the car. This is
accomplished by updating both properties at everything tick of the game. A tick is a
periodic call of the OnTick function. On each tick, we will keep the camera following the
car by setting its rotation and position to be that of the test car. The position of the
camera is oﬀset by 2 meters forward and 0.4 meters up relative to the center of the test
car. This places the camera on the center of the hood of the car as seen in Figure 4.
// Function used to keep camera on vehicle and facing forward on each tick step.
public void keepCameraOnVehicle()
{
if (Game.Player.Character.IsInVehicle())
{
// keep the camera in the same position relative to the car
camera.AttachTo(Game.Player.Character.CurrentVehicle,
new Vector3(0f, 2f, 0.4f));
// rotate the camera to face the same direction as the car
camera.Rotation = Game.Player.Character.CurrentVehicle.Rotation;
}
}
14

Figure 4: The red dot represents camera location.
void OnTick(object sender, EventArgs e)
{
keepCameraOnVehicle();
}
3.3 Desired Functions
Being inside the game with our test vehicle, we want to collect training data. Existing
datasets provide good inspiration for what should be collected. A common datum is the
coordinates of bounding boxes for objects such as cars as in [7], [9] and [13] and traﬃc
signs as in [23], [21], [15] and [24]. Pixel maps representing areas in the image where cer-
tain objects are also common. ROMA [30] has pixel of lane marking. KITTI Road/Lane
Detection Evaluation 2013 [12] has pixel of road areas marked. Daimler Pedestrian Seg-
mentation Benchmark Dataset [11] has pixel of pedestrians marked. Lastly, we would
like to make measurements of distances to lanes and cars in a framework from [8]. The
following sections describe ways of collecting the above information for X, Y data pairs.
15

3.4 Screenshots
To collect X, we take a screen shot of the game. GTA 5 runs only on Windows.
Using Windows user32.dll functions GetForegroundWindow, GetClientRect, and Client-
ToScreen, we can extract the exact area of the screen where the game appears. Neural
networks take small, usually 100 pixels by 200 pixels, images as input, we set the game
resolution to be as small as possible and let h = IMAGE HEIGHT = 600 pixels and w =
IMAGE WIDTH = 800 pixels. These could be furthered scaled down to fit a particular
model such as [8]. For implementation please see Appendix A.
3.5 Bounding Boxes
A bounding box is a pair of points which defines a rectangle which encompasses an
object in an image. Let b = {(xmin, ymin), (xmax, ymax)} be a bounding box, where
xmin, ymin, xmax, ymax are coordinates in an image in pixels with the upper left corner
being the origin. The task of creating bounding boxes includes computing the extremes
of a 3 dimensional object and enclosing them in a rectangle. The algorithm for doing
this is very simple.
Algorithm 1 Algorithm for computing a bounding box.
Require: Model m and center c of an object
1: get the dimensions of m → (h, w, d)
2: compute unit vectors with respect to the object (ex, ey, ez)
3: using ex, ey, ez and h, w, d compute the set of vertices v of a cube enclosing the object
4: map each point p ∈ v to the viewing plane using g : R3
→ R3
to create set z
5: xmin = min
x
z
6: xmax = max
x
z
7: ymin = min
x
z
8: ymax = max
y
z
9: if xmin < 0 then xmin = 0
10: if xmax > IMAGE WIDTH then xmax = IMAGE WIDTH
11: if ymin < 0 then ymin = 0
12: if xmax > IMAGE HEIGHT then ymax = IMAGE HEIGHT
In GTA 5 it is very easy to compute ex, ey, ez and get h, w, d for models for cars,
pedestrians, and traffic signs. Therefore, it is easy to create a bounding cube around an
object. The code excerpt below details the calculation. e is the object we wish to bound
and dim is a vector of the dimensions of the model h, w, d.
Vector3[] vertices = new Vector3[8];
16

vertices[0] = FUL;
vertices[1] = FUL - dim.X*e.RightVector;
vertices[2] = FUL - dim.Z*e.UpVector;
vertices[3] = FUL - dim.Y*Vector3.Cross(e.UpVector, e.RightVector);
vertices[4] = BLR;
vertices[5] = BLR + dim.X*e.RightVector;
vertices[6] = BLR + dim.Z*e.UpVector;
vertices[7] = BLR + dim.Y*Vector3.Cross(e.UpVector, e.RightVector);
There is a function called WorldToScreen which takes a 3 dimensional point in the
world and computes that points location on the screen. Unfortunately, this function
returns the origin if a point is not visible on the screen. This is a problem as we want
to draw a bounding box even if part of the object is out of view, a car coming in on
the left for example. In these cases we want the bounding box to extend to the edge
of the screen. The simplest solution is to map all points to the viewing plane which is
inﬁnite and follow the algorithm above. This requires a custom g function and a good
understanding of the camera model.
3.5.1 GTA 5 Camera Model
Let’s ﬁrst establish some terminology. Let e ∈ R3
be the location of the observer and
let c ∈ R3
be a point on the viewing plane, the plane where the image of the world is
formed, such that vector p from e to c represents the direction the camera is pointing
and is perpendicular to the viewing plane. Additionally, let θ be a rotation vector of the
camera relative to the world coordinates. After a lot of experimentation, we determined
that the position property of the camera object in GTA 5 refers to e. θ measures angles
counterclockwise in degrees. When θ = 0, the camera is facing down the positive y-axis
and the view plane is thus the xz-plane. The order of rotation from this position is
around x-axis then y-axis and then z-axis.
3.5.2 From 3D to 2D
Based on the information about the camera model, we can take a 3 dimensional point
in the world and then map it to the viewing plane and then transform it to screen pixels.
Let a ∈ R3
be the point we wish to map. First we must transform this point to the
camera coordinates. This is accomplished by rotating a using the equations below and
subtracting c, the subtraction is omitted.
17

Figure 5: Camera model and parameters in GTA 5


dx
dy
dz

 =


cos(θx) −sin(θx) 0
sin(θx) cos(θx) 0
0 0 1




cos(θy) 0 sin(θy)
0 1 0
−sin(θy) 0 cos(θy)




1 0 0
0 cos(θx) −sin(θx)
0 sin(θx) cos(θx)




ax
ay
az


dx = cos(θz)[axcos(θy) + sin(θy)[aysin(θx) + azcos(θx)]] − sin(θz)[aycos(θx) − azsin(θx)]
dy = sin(θz)[axcos(θy) + sin(θy)[aysin(θx) + azcos(θx)]] + cos(θz)[aycos(θx) − azsin(θx)]
dz = −axsin(θy) + cos(θy)[aysin(θx) + azcos(θx)]
We also need to rotate the vector representing the z direction in the world, vup,world and
the vector representing the x direction in the world, vx,world. We also need to compute
the width and hight of the region of the view plane which is actually displayed on screen.
We call this region the view window. In the equations below F is the ﬁeld of view in
radians and dnear clip is the distance between c and e.
viewWindowHeight = 2 ∗ dnear cliptan(F/2)
viewWindowWidth =
IMAGE WIDTH
IMAGE HEIGHT
∗ viewWindowHeight
18

We then compute the intersection point between vector d − e and the viewing plane,
call it pplane. We translate the origin to the upper left corner of the view window and
update pplane to pplane.
newOrigin = c +
viewWindowHeight
2
∗ vup,camera −
viewWindowWidth
2
∗ vx,camera
pplane = (pplane + c) − newOrigin
Next we calculate the coordinates of pplane in the two dimensions of the plane.
viewPlaneX =
p T
planevx,camera
vT
x,cameravx,camera
viewPlaneZ =
p T
planevup,camera
vT
up,cameravup,camera
Finally we scale the coordinates to the size of the screen. UI.WIDTH and UI.HEIGHT
are in-game constants.
screenX =
viewPlaneX
viewWindowWidth
∗ UI.WIDTH
screenY =
−viewPlaneZ
viewWindowHeight
∗ UI.HEIGHT
The process is summarized below.
Algorithm 2 get2Dfrom3D: Algorithm for computing screen coordinates of a 3D point.
Require: a
1: translate and rotate a into camera coordinates point d
2: rotate vup,world, vx,world to vup,camera, vx,camera
3: compute viewWindowHeight, viewWindowWidth
4: ﬁnd intersection of d − e with the viewing plane
5: translate origin of the viewing plane
6: calculate the coordinates of the intersection point in the plane
7: scale the coordinates to screen size in pixels
19

3.5.3 General Approach To Annotation of Objects
The main objective is to draw bounding boxes around objects which are within a
certain distance. There exist functions GetNearbyVehicles, GetNearbyPeds, and Get-
NearbyEntities. These functions allows us to get an array of all cars, pedestrians and
objects in an area around the test car. Each object can be tested individually to see if
it is visible on the screen. We created a custom function for doing so as the in game
function has unreliable behavior. This function works by checking if it is possible to
draw a strait line between e and at least one of the vertices of the bounding cube without
hitting any other object. The name of this methods is ray casting and it will be discussed
in more detail later. It must be noted that in the hierarchy of the game, pedestrians and
vehicles are also entities. Therefore a filtering process is applied when bounding signs.
This process is discussed in the signs section.
3.5.4 Cars
Compared to TORCS, GTA 5 has almost ten time more car models. There are 259
vehicles in GTA V (See http://www.ign.com/wikis/gta-5/Vehicles for the complete list).
There vehicles are of various shapes and sizes, from golf carts to truck and trailers. This
diversity is more representative of the real distribution of vehicles and can hopefully be
utilized to train more accurate neural networks. The above method can put a bounding
box around any of these vehicles. Please see Figures 6, 7, and 8 for examples.
3.5.5 Pedestrians
Pedestrians can also be bounded for classification and localization training. GTA 5
has pedestrians of various genders and ethnicities. More importantly, the pedestrians in
GTA 5 perform various actions like standing, crossing streets, sitting etc. This creates a
lot of diversity for training. The draw back of GTA 5 is that all pedestrians are about
the same height.
3.5.6 Signs
As mentioned before, signs are a bit more tricky to bound. There are two reasons for
this. First, the only way to find get signs which are around the test vehicle is to get all
entities. This includes cars, pedestrians, and various miscellaneous props, many of which
20

Figure 6: Two cars bounded in boxes. Weather: rain.
Figure 7: Two cars bounded in boxes.
21

Figure 8: Traﬃc jam bounded in boxes.
Figure 9: Pedestrians bounded in boxes.
22

Sign Description DOT Id [3] GTA Picture
Stop Sign R1-1
Yield Sign R1-2
One Way Sign R6-1
No U-Turn Sign R3-4
Freeway Entrance D13-3
Do Not Enter Wrong Way Sign R5-1 and R5-1a
Figure 10: Some of the traffic signs present in GTA 5.
are of no interest. Thus we need to check each entity for its model to see if it is a traffic
sign. To do so, we need a list of all of the models of all traffic signs in GTA 5. This list
would include many of the signs listed in Uniform Traffic Control Devices [3]. See Figure
10 for some of the signs in GTA 5.
The second difficulty with traffic signs is that they may require more than one bounding
box. For example, a traffic light may have several lights on it, see figure 12. This leads to
the idea of spaces of interest, or SOP. One sign model may have several space of interest
we wish to bound.
23

Figure 11: Stop sign in bounding box.
Figure 12: Traﬃc lights in bounding boxes.
24

There is an elegant solution to both problems. The solution is a database of spaces
of interest. Every entry contains a model hash code, name of the sign, and the x,y,z
coordinates of the front upper left and back lower right vertices of the bounding cube.
Which such a database, the algorithm for bounding sign is as follows:
Algorithm 3 Algorithm for bounding signs.
Require: d - database of spaces of interest
1: read in d
2: get array of entities e from GetNearbyEntities
3: for each entity in e do
4: check if the model of the entity matches any hash codes in d
5: get all the matching spaces of interest
6: for each space of interest do
7: draw a bounding box
3.6 Pixel Maps
Pixel maps are more reﬁned bounding boxes. Instead of marking an entity with four
pixels, we mark it with every pixel it occupies on the screen. This can be done easily when
we start with a bounding box b = {(xmin, ymin), (xmax, ymax)} and invert the function
which maps 3 dimensional point to the screen. The inverse of g can be constructed
as follows. Given a screenX and screenY in pixels, we transform the pixel values to
coordinates on the viewing plane. Next, we transform the point on the viewing plane
into a point in the 3 dimensional world, pworld.
viewPlaneX =
screenX
UI.WIDTH
∗ viewWindowWidth
viewPlaneZ =
−screenY
UI.HEIGHT
∗ viewWindowHeight
pworld = viewPlaneX ∗ vx,camera + viewPlaneZ ∗ vup,camera + newOrigin
Once we compute pworld, we use Raycast function to get the entity which occupies that
pixel. They Raycast function requires a point of origin, in our case e, a direction, in our
case pworld −e and a maximum distance the ray should travel, which we could set to be a
very large number like 10,000. If the entity returned by Raycast matches the entity the
bounding box encloses, then we added the pixel to the map.
25

Algorithm 4 Algorithm for computing a pixel map of an entity.
Require: entity, b = {(xmin, ymin), (xmax, ymax)}
1: let map be a boolean array IMAGE WIDTH by IMAGE HEIGHT
2: for x ∈ {xi|xi ∈ Z, xmin ≤ xi ≤ xmax} do
3: for y ∈ {yi|yi ∈ Z, ymin ≤ yi ≤ ymax} do
4: compute pworld of x, y
5: Raycast from e in direction of pworld − e to get entityRaycast
6: if entity = entityRaycast then
7: set map[x, y] to true
Depending on the application, these maps can be combined together using the OR
boolean function. The function for pixel maps is yet to be implemented due to time
constraints. Besides being a trivial extension of bounding boxes, it is also less useful for
machine learning due to a cumbersome and perhaps unnecessarily complex representation
of objects. Figures 13 and 14 show what the result of such a function would look like.
3.7 Road Lanes
Identifying and locating cars, pedestrians, and signs will only help with a part of the
driving task. Even without any of these things present, drivers must still stay within a
specified lane. Ultimately, locating the lanes and the vehicle’s position in them is the
foundation of the driving task. We will explore a method for extracting information
similar to [8] from GTA 5.
3.7.1 Notes on Drivers
First, let’s examine how real drivers collect information on lane positions. There is
ample literature on the topic. The general consensus is that humans look ahead about 1
second to locate lanes. [10] [20] [19] This time applies for speeds between 30 kmh and
60 kmh [10] [19] and corresponds to a distance of about 10 meters. In a more detailed
model, human drivers have 2 distances at which they collect information. At 0.93 s or
15.7 m road curvature information is collected [19] and at 0.53 or 9 m position in lane
is collected [19]. Near information is used to fine tune driving and is sufficient at low
speeds [19]. At high speeds, the further level is used for guidance and stabilization [10].
Divers also look about 5.5 degrees below the true horizon for road data. [19] For curves,
humans use a tangent point on the inside of the curve for guidance [20]. They locate this
point 1 to 2 seconds before entering the curve. [20]
26

Figure 13: Image with a bounding box.
27

Figure 14: Image with a pixel map for a car applied.
28

3.7.2 Indicators
From literature on human cognition, we know where people look for information on
road lanes. In [8], we find a very useful model on what information to collect. Chenyi et
al. system uses 13 indicators for navigating down a highway like racetrack. While this
roadway is very simple compared to real world road which have exits, entrances, shared
left turn lanes, and lane merges, the indicators are quite universal. Figure 15 lists the
indicators, their descriptions, and ranges.
3.7.3 Road Network in GTA 5
The GTA 5 road network is composed of 74,530 nodes and 77,934 links. [2] For each
node there are x, y, z coordinates and 19 flags and each link consists of 2 node ids and 4
flags. [2] This information is contained in paths.ipl. Figures 16 and 17 show which flags
are currently known. It does not appear that any of these flags would be particularly
useful to figuring out the location of the lane markings.
The Federal Highway Administration sets lane width for freeway lane at 3.6 m (12
feet) and for local roads between 2.7 m and 3.6 m. Ramps are between 3.6 and 9 m (12
to 30 feet). [1]. Based on measurements, the lanes in GTA 5 are 5.6 meters wide. This
should not be a problem when the network is applied to real world applications since the
output can always be scaled.
3.7.4 Finding the Lanes
We know what information we would like to collect and we know that we want to
collect it at a point in the road about 10 meters in front of the test car. Figure 18
represents our data collection situation. We want to compute where the lanes are at blue
line. Assuming we could locate the markings for the left, right and middle lanes, we
could then see if there are any cars whose positions fall between these points. The cars
would also have to be visible on the screen and no further then some maximum distance.
Following [8], this distance d could be 70 meters.
We can compute the indicators if we know the position of the lanes and the heading
of the road. Let h be the heading vector of the road at the 10 meter mark. Let LL, ML,
MR, and RR be points on the lane markings where the blue line intersects the lanes.
Let f be a point on the ground at the very front of the test vehicle, possibly below the
camera. We will perform the calculation for the three lane indicators are the two lane
indicator can be filled in with values set based on these indicators. The angle is simply
the angle between the test car heading vector, hcar and the road heading vector.
29

Indicators
Indicator Description Min Value Max Value
angle
angle between the cars heading and
the tangent of the road
-0.5 0.5
dist L
distance to the preceding car in the
left lane
0 75
dist R
distance to the preceding car in the
right lane
0 75
toMarking L distance to the left lane marking -7 -2.5
toMarking M distance to the central lane marking -2 3.5
toMarking R distance to the right lane marking 2.5 7
dist LL
dist LL: distance to the preceding car
in the left lane
0 75
dist MM
dist MM: distance to the preceding
car in the current lane
0 75
dist RR
dist RR: distance to the preceding car
in the right lane
0 75
toMarking LL
distance to the left lane marking of
the left lane
-9.5 -4
toMarking ML
distance to the left lane marking of
the current lane
-5.5 -0.5
toMarking MR
distance to the right lane marking of
the current lane
0.5 5.5
toMarking RR
distance to the right lane marking of
the right lane
4 9.5
Figure 15: List of indicators, their ranges and positions. Distances are in meters, and
angles are in radians. Graphic reproduced from [8].
30

Flag Meaning
0 0 (primary) or 1 (secondary or tertiary)
1 0 (land), 1 (water)
2 unknown (0 for all nodes)
3 unknown (1 for 65,802 nodes, otherwise 0, 2, or 3)
4 0 (road), 2 (unknown), 10 (pedestrian), 14 (interior), 15 (stop), 16 (stop), 17 (stop),
18 (pedestrian), 19 (restricted)
5 unknown (from 0/15 to 15/15)
6 unknown (0 for 60,111 nodes, 1,141 other values)
7 0 (road) or 1 (highway or interior)
8 0 (primary or secondary) or 1 (tertiary)
9 0 (most nodes) or 1 (some tunnels)
10 unknown (0 for all nodes)
11 0 (default) or 1 (stop - turn right)
12 0 (default) or 1 (stop - go straight)
13 0 (major) or 1 (minor)
14 0 (default) or 1 (stop - turn left)
15 unknown (1 for 10,455 nodes, otherwise 0)
16 unknown (1 for 32 nodes, otherwise 0, on highways)
17 unknown (1 for 62 nodes, otherwise 0, on highways)
18 unknown (1 for 92 nodes, otherwise 0, some turn lanes)
Figure 16: Flags for links. [2]
Flag Meaning
0 unknown (-10, -1 to 8 or 10)
1 unknown (0 to 4 or 6)
2 0 (one-way), 1 (unknown), 2 (unknown), 3 (unknown)
3 0 (unknown), 1 (unknown), 2 (unknown), 3 (unknown), 4 (unknown), 5 (unknown), 8
(lane change), 9 (lane change), 10 (street change), 17 (street change), 18 (unknown),
19 (street change)
Figure 17: Flags for nodes. [2]
angle = cos−1
(
h · hcar
||h||||hcar||
)
For toMarking LL, toMarking ML, toMarking MR, toMarking RR, we will assume
that lanes are straight lines. We have a point on those lines and a vector indicating the
direction they are heading. This assumption is crude, however at the distances we are
discussing it should not produce large errors. Additionally, we could adjust the distance
31

Figure 18: Blue line represents where we want to collect data on lane location.
32

at which we sample date based on road heading. This would not only be more in line
with human behavior [10] [20] [19], it would also reduce errors. To compute the distance
we must project vector f − LL on to vector −h and compute the distance between the
projected point and f − LL. We will work out the mathematics for the left marking of
the left lane, LL.
r = proj−h(f − LL) =
(f − LL) · (−h)
|| − h||2
− h
toMarking LL = ||(f − LL) − r||
To compute dist LL, dist MM, dist RR, we must first figure out which vehicles are in
which lanes. For all the vehicles returned by GetNearbyVehicles, we can eliminate any
whose heading vector form an angle of more than 90 degrees with the heading of the
road. The position of the vehicle, p, must be within a rectangular prism formed by LL,
RR, f and f + d ∗ h in the direction normal to the ground which is also other world up
vector for the test car, vup. This can be computed by projecting LL − f, RR − f, and
d ∗ h onto the plane of f and vup. The following are the projections of the points.
rLL = LL − projvup (LL) =
LL · vup
||vup||2
vup
rRR = RR − projvup (RR) =
RR · vup
||vup||2
vup
rf+d∗h = (f + d ∗ h) − projvup ((f + d ∗ h)) =
(f + d ∗ h) · vup
||vup||2
vup
rp = (p) − projvup ((p)) =
(p) · vup
||vup||2
vup
Now we just have to check that the y coordinate of rp is between the y coordinates of
rLL and rRR and the x coordinate of rp is between 0 and the x coordinate of rf+d∗h. If
the vehicle satisfies these bounds, we can compute its distance to all lane marking in the
same way we did for the test vehicle. We then check to which marking it is closest to
and assign it to that lane, or perform additional logic. Let assume it is the left lane. We
perform the following to compute dist LL.
r = projhcar (p − f) =
(p − f) · (hcar)
||hcar||2
hcar
dist LL = ||r||
33

Algorithm 5 Algorithm for computing dist LL, dist MM, and dist RR.
Require: entity, b = {(xmin, ymin), (xmax, ymax)}
1: create arrays dist LLs, dist MMs, and dist RRs and add d to each
2: l is the lane of the vehicle
3: for each vehicle v returned by GetNearbyVehicles do
4: if cos−1
( h·hcar
||h||||hcar||
) < π
2
then
5: if p is in the three lanes, in front of test car, and close then
6: compute toMarking LL, toMarking ML, toMarking MR, and toMark-
ing RR for p
7: if toMarking LL is smallest then
8: l = left lane
9: if toMarking RR is smallest then
10: l = right lane
11: if toMarking ML is smallest AND toMarking LL < toMarking MR then
12: l = left lane
13: else
14: l = middle lane
15: if toMarking MR is smallest AND toMarking RR < toMarking MR then
16: l = right lane
17: else
18: l = middle lane
19: if l = right lane then
20: add ||projhcar (p − f)|| to dist RRs
21: else if l = left lane then
22: add ||projhcar (p − f)|| to dist LLs
23: else
24: add ||projhcar (p − f)|| to dist MMs
25: dist RR = min dist RRs
26: dist LL = min dist LLs
27: dist MM = min dist MMs
34

To perform the above computation we need a vector representing the heading of the
road and a point on each lane marking. This is where the challenge begins. We cannot use
any of the functions or methods discussed for objects because roads and lane markings
are not entities. The road is part of the terrain and the lanes are a texture. Therefore,
we cannot get the width of the road model or the position of a lane marking the way we
obtained those properties for cars.
GTA 5 has realistic traffic. There are many AI driven cars in the game which navigate
the road network while staying in lanes. Therefore, the game engine knows the location
of the lane markings. There are several functions which pertain to roads. GetStreetName
returns the name of a street at a specified point in the world. IS POINT ON ROAD is
a native function which checks if a point in on a road. There are also several functions
which deal with vehicle nodes.
Vehicle nodes appear to be the primary way the graph of the road network is repre-
sented in the game. Every vehicle node is a point at the center of the road, as seen in
Figure 19. They are spaced out in proportion to the curvature of the road; close together
at sharp corners and further apart on straight stretches of road. Each node has an unique
id.
The main functions for working with nodes include GET NTH CLOSEST VEHICLE NODE
and GET NTH CLOSEST VEHICLE NODE ID. A way to call them in a script is shown
in the code snippet below. In this code snippet, the ”safe” arguements serve an unknown
purpose, as do the two zeros in GET NTH CLOSEST VEHICLE NODE ID. The i vari-
able specified which node in the order of proximity should be selected. There is also a
function GET VEHICLE NODE PROPERTIES. However, we could not find a way to
get this function to work.
OutputArgument safe1 = new OutputArgument();
Vector3 midNode;
OutputArgument outPosArg = new OutputArgument();
Function.Call(Hash.GET_NTH_CLOSEST_VEHICLE_NODE,
playerPos.X, playerPos.Y, playerPos.Z, i, outPosArg, safe1, safe2, safe3);
midNode = outPosArg.GetResult<Vector3>();
int nodeId = Function.Call<int>(Hash.GET_NTH_CLOSEST_VEHICLE_NODE_ID,
playerPos.X, playerPos.Y, playerPos.Z, i, safe1, 0f, 0f);
35

Figure 19: Red markers represent locations of vehicle nodes.
36

The benefit of this system is that we can locate our car on the network by getting
the closest node. Given the road heading and lane width, it is possible to compute the
centers of lanes as seen in Figure 20. The problem is that there is no way of getting the
heading of the road and the number as well position of lanes around the node as far as
we could find.
Figure 20: Red markers represent locations of vehicle nodes. Blue markers are extrapo-
lations of lane middles based on road heading and lane width.
A promising approach to solving this problem was road model fitting. We know that
the node is at the center of the road. We do not know if it is on a lane marking or
in a middle of a lane. We could assume that it is on a lane marking and then count
the number of lanes on the left and right. This could be done by moving a lane width
over and checking if the point is still on the road by using IS POINT ON ROAD and
GetStreetName. We can repeat the same method under the assumption that the node
37

is in the middle of a lane. Whichever assumption found more lanes, that is the correct
assumption as the wrong one will not count the outer most lanes. This still leaves the
question of ﬁnding the heading of the road and if the node is between lanes going in
opposite directions. However, there are two fundamental problems with this approach
which make it useless. First, this approach assumes that the nodes at centers of lanes
or on lane markings. Upon further exploration, we found that nodes can be on medians;
for example Figure 21. This is still the center of the road, just not where we expect it.
Second, IS POINT ON ROAD is not a reliable indicator of whether a point is actually
on a road. Sometimes is returns false for points which are clearly on the road, and
sometimes it returns true for points which are on the side of the road.
Figure 21: Red markers represent locations of vehicle nodes. Blue markers are extrapo-
lations of lane middles based on road heading and lane width. The blue marker in front
of the test car represents where we want to measure lanes.
38

There are two solutions to this problem. The first solution is to keep hacking at the
game until we find all of this information. The information we are looking for must be
somewhere in the game because the game AI knows where to drive. It knows where the
lanes are and how to stay in them. The second solution is to build a database of nodes.
Figure 22 lists the data which would be stored in this database.
Field Meaning
nodeId The numerical id of the node.
onMarking True if the node is on a lane marking or in the middle of a lane.
oneWay True if the traffic on both sides of the node moves in the same direction.
leftStart Vector representing the point where the road begins left of the node.
leftEnd Vector representing the point where the road ends left of the node.
rightStart Vector representing the point where the road begins right of the node.
rightEnd Vector representing the point where the road ends right of the node.
heading Vector representing the heading of the road.
Figure 22: Node database entry design.
The problem with this method is that there are over 70,000 nodes and there does not
appear to be an easy way of collecting this information. At the present moment, there
does not appear to be a simpler solution.
39

4 Towards The Ultimate AI Machine
Previous section outlined methods for getting information out of GTA 5 to create
datasets. To fully utilize GTA 5, we still need to create a database of nodes and spaces
of interest. Once that is done, we will move on to creating datasets and training neural
networks.
The objective of harvesting this data has be emphasized as training data for neural
networks. However, the ultimate goal is much grander. That is building a system which
can master driving in GTA 5. This system would probably include several neural networks
and perhaps other statistical models. For example, this system may include a network for
locating pedestrians, a SVM for classifying street signs, another network for recognizing
traffic lights etc. All of these components would be linked together by some master
program that would construct the most likely world model based on all of these ”sensors”.
Then another program would be responsible for driving the car. Since we can extract
data from GTA 5 in real time, we can test how well this system would work in changing
conditions.
In the process of building such a system, it is possible to test out some new ideas
in neural networks. We would like to continue to explore curriculum learning [5] and
self-paced learning [18] [16] as means of presenting examples in order of difficult. Since
these ideas have been applied to object tracking in video [28], teaching robots motor skills
[17], matrix factorization [31], handwriting recognition [22], and multi-task learning [26];
surpassing state-of-the-art benchmarks, we hope that they could be used to improve
autonomous driving. Another interesting idea is transfer learning [25], or the ability
to use a network trained in one domain in another domain. This could be applied in
pedestrian and sign classifiers. Lastly, we have been working on ways to use optimal
learning to select best neural network architectures. It would be interesting to try those
methods in this application.
Building this system presents 2 major difficulties. First, both the game and the neural
networks are GPU intensive processes. Running both on a single machine would require a
lot of computational power. Second, GTA 5 will only work on Windows PCs, while most
deep learning libraries are Linux based. Porting either application is close to infeasible.
Last semester, we, working with Daniel Stanley and Bill Zhang, constructed a solution
for running GTA 5 with TorcsNet from [8]. The idea was to run the processes on separate
machines and have them communicate via a shared folder on a local network, see Figure
23. During the tests, the amount of data transfered was small, a text file of 13 floats and
a 280 by 210 png image. This setup should is fast enough for the system run at around
10 Hz.
40

Figure 23: GTA V Experimental Setup
4.1 Future Research Goals
Build a database of GTA V road nodes
Build a database of GTA V road signs
Train sign classifier
Train traffic lights classifier
Compare how well GTA V trained classifier works on real datasets
Check how well the TORCS network can identify cars in GTA V
Build a robust controller in GTA V which uses all 13 indicators
Explore the effects of curriculum learning on driving performance
Explore transfer learning and optimal learning for neural networks
Test trained models in a real vehicle (PAVE)
41

A Screenshot Function
private struct Rect
{
public int Left;
public int Top;
public int Right;
public int Bottom;
}
[DllImport("C:WindowsSystem32user32.dll")]
private static extern IntPtr GetForegroundWindow();
private static extern IntPtr GetClientRect(IntPtr hWnd, ref Rect rect);
private static extern IntPtr ClientToScreen(IntPtr hWnd, ref Point point);
void screenshot(String filename)
{
//UI.Notify("Taking screenshot?");
var foregroundWindowsHandle = GetForegroundWindow();
var rect = new Rect();
GetClientRect(foregroundWindowsHandle, ref rect);
var pTL = new Point();
var pBR = new Point();
pTL.X = rect.Left;
pTL.Y = rect.Top;
pBR.X = rect.Right;
pBR.Y = rect.Bottom;
ClientToScreen(foregroundWindowsHandle, ref pTL);
ClientToScreen(foregroundWindowsHandle, ref pBR);
Rectangle bounds = new Rectangle(pTL.X, pTL.Y, rect.Right - rect.Left,
rect.Bottom - rect.Top);
using (Bitmap bitmap = new Bitmap(bounds.Width, bounds.Height))
{
using (Graphics g = Graphics.FromImage(bitmap))
{
42

g.ScaleTransform(.2f, .2f);
g.CopyFromScreen(new Point(bounds.Left, bounds.Top), Point.Empty, bounds.Size);
}
Bitmap output = new Bitmap(IMAGE_WIDTH, IMAGE_HEIGHT);
using (Graphics g = Graphics.FromImage(output))
{
g.DrawImage(bitmap, 0, 0, IMAGE_WIDTH, IMAGE_HEIGHT);
}
output.Save(filename, ImageFormat.Bmp);
}
}
43

References
[1] Lane width. http://safety.fhwa.dot.gov/geometric/pubs/
mitigationstrategies/chapter3/3_lanewidth.cfm. Accessed: 2016-4-29.
[2] Paths (gta v). http://gta.wikia.com/wiki/Paths_(GTA_V). Accessed: 2016-4-29.
[3] F. H. Administration. Manual on uniform traffic control devices. 2009.
[4] A. R. Atreya, B. C. Cattle, B. M. Collins, B. Essenburg, G. H. Franken, A. M. Saxe,
S. N. Schiffres, and A. L. Kornhauser. Prospect eleven: Princeton university’s entry
in the 2005 darpa grand challenge. Journal of Field Robotics, 23(9):745–753, 2006.
[5] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In
Proceedings of the 26th annual international conference on machine learning, pages
41–48. ACM, 2009.
[6] C. G. C. D. R. C. A. S. Bernhard Wymann, Eric Espie. Torcs the open racing car
simulator. http://www.torcs.org, 2014.
[7] S. M. Bileschi. StreetScenes: Towards scene understanding in still images. PhD
thesis, Citeseer, 2006.
[8] C. Chen, A. Seff, A. Kornhauser, and J. Xiao. Deepdriving: Learning affordance for
direct perception in autonomous driving. arXiv preprint arXiv:1505.00256, 2015.
[9] P. Dollár, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: A benchmark.
In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference
on, pages 304–311. IEEE, 2009.
[10] E. Donges. A two-level model of driver steering behavior. Human Factors: The
Journal of the Human Factors and Ergonomics Society, 20(6):691–707, 1978.
[11] F. Flohr, D. M. Gavrila, et al. Pedcut: an iterative framework for pedestrian seg-
mentation combining shape models and multiple data cues. 2013.
[12] J. Fritsch, T. Kuehnl, and A. Geiger. A new performance measure and evaluation
benchmark for road detection algorithms. In International Conference on Intelligent
Transportation Systems (ITSC), 2013.
[13] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti
vision benchmark suite. In Conference on Computer Vision and Pattern Recognition
(CVPR), 2012.
[14] E. Guizzo. How googles self-driving car works. IEEE Spectrum Online, October, 18,
2011.
44

[15] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and C. Igel. Detection of traffic
signs in real-world images: The German Traffic Sign Detection Benchmark. In
International Joint Conference on Neural Networks, number 1288, 2013.
[16] L. Jiang, D. Meng, S.-I. Yu, Z. Lan, S. Shan, and A. Hauptmann. Self-paced learning
with diversity. In Advances in Neural Information Processing Systems, pages 2078–
2086, 2014.
[17] A. Karpathy and M. Van De Panne. Curriculum learning for motor skills. In
Advances in Artificial Intelligence, pages 325–330. Springer, 2012.
[18] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable
models. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta,
editors, Advances in Neural Information Processing Systems 23, pages 1189–1197.
Curran Associates, Inc., 2010.
[19] M. Land, J. Horwood, et al. Which parts of the road guide steering? Nature,
377(6547):339–340, 1995.
[20] M. F. Land and D. N. Lee. Where do we look when we steer. Nature, 1994.
[21] F. Larsson and M. Felsberg. Using fourier descriptors and spatial models for traffic
sign recognition. In Image Analysis, pages 238–249. Springer, 2011.
[22] J. Louradour and C. Kermorvant. Curriculum learning for handwritten text line
recognition. In Document Analysis Systems (DAS), 2014 11th IAPR International
Workshop on, pages 56–60. IEEE, 2014.
[23] M. Mathias, R. Timofte, R. Benenson, and L. Van Gool. Traffic sign recognitionhow
far are we from the solution? In Neural Networks (IJCNN), The 2013 International
Joint Conference on, pages 1–8. IEEE, 2013.
[24] A. Mogelmose, M. M. Trivedi, and T. B. Moeslund. Vision-based traffic sign detec-
tion and analysis for intelligent driver assistance systems: Perspectives and survey.
Intelligent Transportation Systems, IEEE Transactions on, 13(4):1484–1497, 2012.
[25] S. J. Pan and Q. Yang. A survey on transfer learning. Knowledge and Data Engi-
neering, IEEE Transactions on, 22(10):1345–1359, 2010.
[26] A. Pentina, V. Sharmanska, and C. H. Lampert. Curriculum learning of multiple
tasks. arXiv preprint arXiv:1412.1353, 2014.
[27] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by
back-propagating errors. Cognitive modeling, 5(3):1, 1988.
45

[28] J. S. Supancic and D. Ramanan. Self-paced learning for long-term tracking. In Com-
puter Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages
2379–2386. IEEE, 2013.
[29] S. Thrun, M. Montemerlo, H. Dahlkamp, D. Stavens, A. Aron, J. Diebel, P. Fong,
J. Gale, M. Halpenny, G. Hoffmann, et al. Stanley: The robot that won the darpa
grand challenge. Journal of field Robotics, 23(9):661–692, 2006.
[30] T. Veit, J.-P. Tarel, P. Nicolle, and P. Charbonnier. Evaluation of road mark-
ing feature extraction. In Proceedings of 11th IEEE Conference on Intelli-
gent Transportation Systems (ITSC’08), pages 174–181, Beijing, China, 2008.
http://perso.lcpc.fr/tarel.jean-philippe/publis/itsc08.html.
[31] Q. Zhao, D. Meng, L. Jiang, Q. Xie, Z. Xu, and A. G. Hauptmann. Self-paced
learning for matrix factorization. In Twenty-Ninth AAAI Conference on Artificial
Intelligence, 2015.
46

Video Games for Autonomous Driving

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Similar to Video Games for Autonomous Driving

Similar to Video Games for Autonomous Driving (20)

More from Artur Filipowicz

More from Artur Filipowicz (9)

Video Games for Autonomous Driving