SlideShare a Scribd company logo
1 of 46
Download to read offline
Driving School II
Video Games for Autonomous Driving
Independent Work
Artur Filipowicz
ORFE Class of 2017
Advisor Professor Alain Kornhauser
arturf@princeton.edu
May 3, 2016
Revised
August 27, 2016
1
Abstract
We present a method for generating datasets to train neural networks and other statistical
models to drive vehicles. In [8], Chen et al. used a racing simulator called Torcs to
generate a dataset of driving scenes which they then used to train a neural network. One
limitation of Torcs is a lack of realism. The graphics are plain and the only roadways
are racetracks, which means there are no intersections, pedestrian crossings, etc. In this
paper we employ a game call Grand Theft Auto 5 (GTA 5). This game features realistic
graphics and a complex transportation system of roads, highways, ramps, intersections,
traffic, pedestrians, railroad crossings, and tunnels. Unlike Torcs, GTA 5 has more car
models, urban, suburban, and rural environments, and control over weather and time.
With the control of time and weather, GTA 5 has an edge over conventional methods of
collecting datasets as well.
We present methods for extracting three particular features. We create a function for
generating bounding boxes around cars, pedestrians and traffic signs. We also present
a method for generating pixel maps for objects in GTA 5. Lastly, we develop a way to
compute distances to lane markings and other indicators from [8]
2
Acknowledgments
I would like to thank Professor Alain L. Kornhauser for his
mentorship during this project and Daniel Stanley and Bill
Zhang for their help over the summer and last semester.
This paper represents my own work in accordance with University
regulations.
Artur Filipowicz
3
Contents
1 From The Driving Task to Machine Learning 6
1.1 The Driving Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 The World Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Datasets for the Driving Task 10
2.1 Cars, Pedestrians, and Cyclysis . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Lanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Observations on Current Datasets . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Video Games and Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Sampling from GTA 5 12
3.1 GTA 5 Scripts Development . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Test Car . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Desired Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Screenshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Bounding Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5.1 GTA 5 Camera Model . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.2 From 3D to 2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.3 General Approach To Annotation of Objects . . . . . . . . . . . . 20
3.5.4 Cars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.5 Pedestrians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.6 Signs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.6 Pixel Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.7 Road Lanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.7.1 Notes on Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.7.2 Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.7.3 Road Network in GTA 5 . . . . . . . . . . . . . . . . . . . . . . . 29
3.7.4 Finding the Lanes . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 Towards The Ultimate AI Machine 40
4.1 Future Research Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
A Screenshot Function 42
List of Figures
1 Graphics and roads in Torcs. . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Graphics and roads in GTA 5. . . . . . . . . . . . . . . . . . . . . . . . . 12
4
3 Test Vehicle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 The red dot represents camera location. . . . . . . . . . . . . . . . . . . . 15
5 Camera model and parameters in GTA 5 . . . . . . . . . . . . . . . . . . 18
6 Two cars bounded in boxes. Weather: rain. . . . . . . . . . . . . . . . . 21
7 Two cars bounded in boxes. . . . . . . . . . . . . . . . . . . . . . . . . . 21
8 Traffic jam bounded in boxes. . . . . . . . . . . . . . . . . . . . . . . . . 22
9 Pedestrians bounded in boxes. . . . . . . . . . . . . . . . . . . . . . . . . 22
10 Some of the traffic signs present in GTA 5. . . . . . . . . . . . . . . . . . 23
11 Stop sign in bounding box. . . . . . . . . . . . . . . . . . . . . . . . . . . 24
12 Traffic lights in bounding boxes. . . . . . . . . . . . . . . . . . . . . . . . 24
13 Image with a bounding box. . . . . . . . . . . . . . . . . . . . . . . . . . 27
14 Image with a pixel map for a car applied. . . . . . . . . . . . . . . . . . . 28
15 List of indicators, their ranges and positions. Distances are in meters, and
angles are in radians. Graphic reproduced from [8]. . . . . . . . . . . . . 30
16 Flags for links. [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
17 Flags for nodes. [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
18 Blue line represents where we want to collect data on lane location. . . . 32
19 Red markers represent locations of vehicle nodes. . . . . . . . . . . . . . 36
20 Red markers represent locations of vehicle nodes. Blue markers are ex-
trapolations of lane middles based on road heading and lane width. . . . 37
21 Red markers represent locations of vehicle nodes. Blue markers are ex-
trapolations of lane middles based on road heading and lane width. The
blue marker in front of the test car represents where we want to measure
lanes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
22 Node database entry design. . . . . . . . . . . . . . . . . . . . . . . . . . 39
23 GTA V Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 41
5
1 From The Driving Task to Machine Learning
1.1 The Driving Task
The driving task is a physics problem of moving an object from point a ∈ R4
to b ∈ R4
,
with time being the fourth dimension, without colliding with any other object. There
are also additional constraints in the form of lane markings, speed limits, and traffic flow
directions. Even with all constraints beyond avoiding collisions, the physical problem of
finding a navigable path is easy given a model of the world. That is, if the location of all
objects and their shapes is known with certainty and the location of the constraints is
known, then the task becomes first the computation of a path in a digraph G representing
the road network and then for each edge finding unoccupied space and moving the object
into it. All of these problems can be solved using fundamental physics and computer
science. What makes the driving task difficult in the real world setting is the lack of an
accurate world model. In reality we do not have omniscient drivers.
1.2 The World Model
People drive, and so do computers to a limited extent. Therefore, omniscience is not
necessary. Some subset of the total world model is good enough to perform the driving
task. Perhaps with limited knowledge, it is only possible to successful complete the task
with a probability less than 1, but the success rate is high enough for people to utilize
this form of transport.
To drive, we still need a world model. This model is constructed by the means of sensor
fusion, the combination of information from several different sensors. In 2005, Princeton
University’s entry in the DARPA Challenge, Prospect 11, used radar and cameras to
identify and locate obstacles. Based on these measurements and GPS data, the on-board
computer would create a world model and find a safe path. [4] In a similar approach, the
Google Car uses radar and lidar to map the world around it. [14]
Approaches in [4], [14], and [29] appear rather cumbersome and convoluted compared
to the human way of creating a world model. Humans have 5 sensors, the eyes, the
nose, the ears, the mouth, and the skin. In driving neither taste nor smell nor touch are
used to build the world model as all of these senses are mostly cut off from the world
outside the vehicle. The driver can hear noises from the outside. However, they can be
muffled by the sound of the driver’s own vehicle and many important objects, such as
street signs and lane markings, do not make noise. To construct the world model humans
predominantly use one sensor, the eyes. We can suspect that there is enough information
encoded in visible light coming through the front windshield to build a world model good
enough for completing the driving task. However, research on autonomous vehicles - the
construction of solution to the driving task using artificial intelligence - stays away from
6
approaching the problem in the pure vision way, as noted in [4] and [29]. The reason for
this is that vision, computer vision in particular, is difficult.
1.3 Computer Vision
Let X ∈ Rh∗w∗c
be an image of width w and height h and c colors. As we stated earlier,
X has enough information for a human to figure out where lane markings and other
vehicles are, identify and classify road signs and perform other measurements to build a
world model. Perhaps, maybe several images in a sequence are necessary, although [8]
shows that one image can be used to extract a lot of information. The difficult of computer
vision is that X is a matrix of numbers representing colors of pixels. In this representation
an object can appear very different depending on lighting conditions. Additionally, due
to perspective, objects appear in different sizes and therefore occupy different number of
pixels, even if the object is the same. These are two of many variations which humans
can account for, but naive machine approaches fail.
Computer vision is difficult but not impossible. In recent decades, researches used ma-
chine learning to enable computers to take X and construct more salient representations.
1.4 Machine Learning
The learning task is as follows; given some image Xi, we wish to predict a vector of
indicators Yi. Yi could be distances to lane markings, vehicles, locations of street sings
etc. and can then be used to construct a world model. To that end, we want to train a
function f such that Yi = f(Xi). We say that Xi, Yi ∼ PX,Y .
The machine learning approach to this problem mimics humans in more then just the
focus on visual information. The primary method of learning images is the use of neural
networks, more specifically convolutional neural networks. These statistical models are
inspired by neurons which make up the nerves and brain areas responsible for vision.
The mathematical abstraction is represented as follows:
Let f(xi, W) be a neural network of L hidden layers. The sizes of these layers are l1 to
lL.
f(xi, W) = gL(W(L)...g3(W(3)g2(W(2)g1(W(1)xi)))...)
W = {W(1), W(2), ...W(L)}
W(i) ∈ Rli+1× li
gi(x) is some activation function.
7
The process of training becomes the process of adjusting values of W(i). This first
requires some loss function which expresses the error made by the network. A common
loss function is L2. We wish to create a neural network model f such that
min
f
L2(T , f)
where let D be a dataset of n indicator Yi and image Xi pairs
D = {(Xi, Yi)}n
i=1
and let R be the training set and let T be the test set.
R ⊂ D
T ⊂ D
R ∩ T = Ø
R ∪ T = D
|R| = r
|T | = t
To minimize the loss function with respect to W, the most common method is the
use of Back-Propagation Algorithm [27]. Back-Propagation Algorithm uses stochastic
gradient decent to find a local minimum of a function. At each iteration j of J, Back-
Propagation Algorithm updates W
Wj+1 = Wj − η
∂E(W)
∂Wj
The two sources of randomness in the algorithm are W0 and the order in which training
examples are used π. The initial values of element in matrices in W0 are uniform random
variable. The ordering of examples is also often a random sample with replacement of J
(xi, yi) ∈ R
On an intuitive level, the network adjusts W to extract useful features from the image
pixel values Xi. In the process it builds the distribution Xi, Yi ∼ PX,Y . In theory the
larger the W the more capacity the network has for extracting and leaning features and
representing complex distributions. At the same time, it is also more likely to fit noise in
the data a nonsalient features such as clouds. This overfitting causes poor generalization
and we need a network which can generalize to many driving scenes. There are several
regularization techniques to overcome overfitting. These include L1, L2, dropout, and
others. However, these will only be effective if we do have the data adequately represent
the domain of PX,Y . This domain for driving scenes is huge considering it includes
8
images of all the different kinds of roads, vehicles, pedestrians, street signs, traffic lights,
intersections, ramps, lane marking, lighting conditions, weather conditions, times of day
and positions of the camera. [8] tested a network in a limited subset of these conditions
and they used almost half million images for training.
9
2 Datasets for the Driving Task
Machine learning for autonomous vehicles has been studied for years. Therefore, several
datasets already exist. These datasets come in two types. There are datasets of objects
of interest in the driving scenes which include vehicles (cars, vans, trucks), cyclists,
pedestrians, traffic lights, lane markings and street signs. Usually, these datasets provide
coordinates of bounding boxes around the objects Yi. These are useful for training
localization and classification models. The second type of datasets provide distances to
lane markings, cars and pedestrians. These are used to train regression models. Here we
will give a brief overview of several of these datasets.
2.1 Cars, Pedestrians, and Cyclysis
Daimler Pedestrian Segmentation Benchmark Dataset contains 785 images of pedestri-
ans in an urban environment captured by a calibrated stereo camera. The groundtruth
consists of true pixel shape and disparity map. [11]
CBCL StreetScenes Challenge Framework contains 3,547 images of driving scenes cap-
tured with bounding boxes for 5,799 cars, 1,449 pedestrians, 209 cyclists, as well as
buildings, roads, sidewalks, stores, tree, and the sky. The images have been captured by
photographers from street, crosswalk, and sidewalk views. [7]
KITTI Object Detection Evaluation 2012 contains 7481 training images and 7518 test
images with each image containing several objects. The total number of objects is 80,256,
including cars, pedestrians, and cyclists. The groundtruth includes a bounding box for
the object as well as an estimate of the orientation in the bird’s eye view. [13]
Caltech Pedestrian Detection Benchmark contains 10 hours of driving in an urban
environment. The groundtruth contains 350,000 bounding boxes for 2300 unique pedes-
trians. [9]
There are several datasets for street signs [23], [21], and [15]. However, these datasets
have been made in European countries and therefore they contain European signs which
are very different from their US counterparts. Luckily [24] is a dataset of 6,610 images
containing 47 different US road signs. For each sign the annotation includes sign type,
position, size, occluded (yes/no), and on side road (yes/no).
2.2 Lanes
KITTI Road/Lane Detection Evaluation 2013 has 289 training and 290 test images
of road lanes with groundtruth consisting of pixels map the road area and the lane the
10
vehicle is in. The dataset contains images from three environment urban with unmarked
lanes, urban with marked lanes and urban with multiple marked lanes. [12]
ROMA lane database has 116 images of different roads with groundtruth pixel positions
of visible lane markings. The camera calibration specifies the pixel distance to true
horizon and conversions between pixel distances and meters. [30]
2.3 Observations on Current Datasets
The above datasets are quite limited. First, most of them are small when compared to
the half a million images used in [8]. Second, they do not represent many of the driving
conditions such as different weather conditions or times of day; the reason for this is that
measuring equipment, especially cameras, can only function in certain conditions. Since
this tends to be sunny weather, most of these datasets are collected during such times.
Additionally, all of these datasets include some amount of manual labeling which is not
feasible when the dataset includes millions of images.
2.4 Video Games and Datasets
The problems associated with the datasets would be resolved if we could somehow
sample from PX,Y both Xi and Yi without having to spend time to measure Yi. This is
not possible in the real world. However, [8] decided to use a virtual world, a racing video
game called Torcs [6]. The hope behind this approach is that the game can simulate
PX,Y well enough so that the network, once trained, will be able to generalize to the real
world. Let us assume that this is true.
The main benefit of using Torcs and other video games is access to the game engine.
This allows us to extract the true Yi for each Xi we harvest from the screen. Torcs itself
has several restrictions which limit it from simulating the range of driving conditions
present in the real world. Fundamentally it is a racing game with circular, one-way
tracks. The weather and lighting conditions are fixed. The textures are rather simple
and thus unrealistic.
To overcome these limitations and allow for a more diverse and realistic dataset, we
focus on the game called Grand Theft Auto 5 (GTA5). Unlike Torcs, the makers of GTA5
had the funds to create a very realistic world since they were developing a commercial
product and not an open-source research tool. GTA5 has hundreds of different vehicles,
pedestrians, freeways, intersections, traffic signs, traffic lights, rich textures, and many
other elements which create a realistic environment. Additionally, GTA5 has about 14
weather conditions and simulates lighting conditions for 24 hours of the day. To tap into
these features, the next section examines ways of extracting various data.
11
Figure 1: Graphics and roads in
Torcs.
Figure 2: Graphics and roads in
GTA 5.
3 Sampling from GTA 5
3.1 GTA 5 Scripts Development
GTA 5 is a closed source game. There is no out-of-the-box access to the underlying
game engine. However, due to the game’s popularity, fans have hacked into it and
developed a library of functions for interacting with the game engine. This is done
by the use of scripts loaded into the game. The objective of this paper is not to give
tutorial on coding scripts for GTA 5, and as such we will keep the discussion of code
to a minimum. However, we will explain some of the code and game dynamics for the
purpose of reproducibility and presentation of the methods used to extract data.
Two tools are needed to write scripts for GTA 5. The first tool is ScritHook by Alexan-
der Blade. This tool can be downloaded from: https://www.gta5-mods.com/tools/script-
hook-v or http://www.dev-c.com/gtav/scripthookv/. It comes with a useful trainer
which provides basic control over many game variables including weather and time. The
next tool is a library called Script Hook V .Net by Patrick Mours which allows us to use
C# and other .Net languages to write scripts for GTA 5. The library can be downloaded
from https://www.gta5-mods.com/tools/scripthookv-net. For full source code and list
of functions please see https://github.com/crosire/scripthookvdotnet.
3.2 Test Car
To make the data collection more realistic we will use an in-game vehicle, the test car,
with a mounted camera; similar to [13]. The vehicle model for the test car was picked
arbitrarily and can be replaced with any other model. Besides the steering controls, we
introduce 3 new functions for the following keys: NumPad0, ”I”, and ”O”. NumPad0
spawns a new instance of our test car. ”I” mounts the rendering camera on the test car.
12
Figure 3: Test Vehicle
”O” restores the control of the rendering camera back to the original state. Let us look
at the some of the code for the test car.
The TestVehicle() function is a constructor for the TestVehicle class. It is called once
when all of the scripts are loaded. This occurs at the start of the game and can be
triggered at any point in the game by hitting the ”insert” key. This constructor gains
control of the camera which is rendering the game by destroying all cameras and creating
a new rendering camera. The function responsible for this is World.CreateCamera. The
first two arguments represent position and rotation. The last argument is the field of
view in degrees. We set it to 50, however this could be changed to fit the parameters of
a real world camera.
It is important to note GTA.Native.Function.Call. GTA 5’s game engine has thousands
of native functions which were used by the developers to build the game. This library
encapsulates some of them. Others can be called using GTA.Native.Function.Call where
the first argument is the hash code of the function to call and the remaining arguments
are the arguments to pass to the native function. One of the biggest challenges in this
project is figuring out what these other arguments represent and control. There are
13
online databases where players of the game list known functions and parameters. These
databases are far from complete. Therefore, for some of these native function calls, some
of the arguments may not have any justification besides that they make the function
work. This is the price paid for using a closed source game.
public TestVehicle()
{
UI.Notify("Loaded TestVehicle.cs");
// create a new camera
World.DestroyAllCameras();
camera = World.CreateCamera(new Vector3(), new Vector3(), 50);
camera.IsActive = true;
GTA.Native.Function.Call(Hash.RENDER_SCRIPT_CAMS, false, true,
camera.Handle, true, true);
// attach time methods
Tick += OnTick;
KeyUp += onKeyUp;
}
The camera position and rotation do not matter in the previous function as they will
be dynamically updated to keep up with the position and rotation of the car. This is
accomplished by updating both properties at everything tick of the game. A tick is a
periodic call of the OnTick function. On each tick, we will keep the camera following the
car by setting its rotation and position to be that of the test car. The position of the
camera is offset by 2 meters forward and 0.4 meters up relative to the center of the test
car. This places the camera on the center of the hood of the car as seen in Figure 4.
// Function used to keep camera on vehicle and facing forward on each tick step.
public void keepCameraOnVehicle()
{
if (Game.Player.Character.IsInVehicle())
{
// keep the camera in the same position relative to the car
camera.AttachTo(Game.Player.Character.CurrentVehicle,
new Vector3(0f, 2f, 0.4f));
// rotate the camera to face the same direction as the car
camera.Rotation = Game.Player.Character.CurrentVehicle.Rotation;
}
}
14
Figure 4: The red dot represents camera location.
void OnTick(object sender, EventArgs e)
{
keepCameraOnVehicle();
}
3.3 Desired Functions
Being inside the game with our test vehicle, we want to collect training data. Existing
datasets provide good inspiration for what should be collected. A common datum is the
coordinates of bounding boxes for objects such as cars as in [7], [9] and [13] and traffic
signs as in [23], [21], [15] and [24]. Pixel maps representing areas in the image where cer-
tain objects are also common. ROMA [30] has pixel of lane marking. KITTI Road/Lane
Detection Evaluation 2013 [12] has pixel of road areas marked. Daimler Pedestrian Seg-
mentation Benchmark Dataset [11] has pixel of pedestrians marked. Lastly, we would
like to make measurements of distances to lanes and cars in a framework from [8]. The
following sections describe ways of collecting the above information for X, Y data pairs.
15
3.4 Screenshots
To collect X, we take a screen shot of the game. GTA 5 runs only on Windows.
Using Windows user32.dll functions GetForegroundWindow, GetClientRect, and Client-
ToScreen, we can extract the exact area of the screen where the game appears. Neural
networks take small, usually 100 pixels by 200 pixels, images as input, we set the game
resolution to be as small as possible and let h = IMAGE HEIGHT = 600 pixels and w =
IMAGE WIDTH = 800 pixels. These could be furthered scaled down to fit a particular
model such as [8]. For implementation please see Appendix A.
3.5 Bounding Boxes
A bounding box is a pair of points which defines a rectangle which encompasses an
object in an image. Let b = {(xmin, ymin), (xmax, ymax)} be a bounding box, where
xmin, ymin, xmax, ymax are coordinates in an image in pixels with the upper left corner
being the origin. The task of creating bounding boxes includes computing the extremes
of a 3 dimensional object and enclosing them in a rectangle. The algorithm for doing
this is very simple.
Algorithm 1 Algorithm for computing a bounding box.
Require: Model m and center c of an object
1: get the dimensions of m → (h, w, d)
2: compute unit vectors with respect to the object (ex, ey, ez)
3: using ex, ey, ez and h, w, d compute the set of vertices v of a cube enclosing the object
4: map each point p ∈ v to the viewing plane using g : R3
→ R3
to create set z
5: xmin = min
x
z
6: xmax = max
x
z
7: ymin = min
x
z
8: ymax = max
y
z
9: if xmin < 0 then xmin = 0
10: if xmax > IMAGE WIDTH then xmax = IMAGE WIDTH
11: if ymin < 0 then ymin = 0
12: if xmax > IMAGE HEIGHT then ymax = IMAGE HEIGHT
In GTA 5 it is very easy to compute ex, ey, ez and get h, w, d for models for cars,
pedestrians, and traffic signs. Therefore, it is easy to create a bounding cube around an
object. The code excerpt below details the calculation. e is the object we wish to bound
and dim is a vector of the dimensions of the model h, w, d.
Vector3[] vertices = new Vector3[8];
16
vertices[0] = FUL;
vertices[1] = FUL - dim.X*e.RightVector;
vertices[2] = FUL - dim.Z*e.UpVector;
vertices[3] = FUL - dim.Y*Vector3.Cross(e.UpVector, e.RightVector);
vertices[4] = BLR;
vertices[5] = BLR + dim.X*e.RightVector;
vertices[6] = BLR + dim.Z*e.UpVector;
vertices[7] = BLR + dim.Y*Vector3.Cross(e.UpVector, e.RightVector);
There is a function called WorldToScreen which takes a 3 dimensional point in the
world and computes that points location on the screen. Unfortunately, this function
returns the origin if a point is not visible on the screen. This is a problem as we want
to draw a bounding box even if part of the object is out of view, a car coming in on
the left for example. In these cases we want the bounding box to extend to the edge
of the screen. The simplest solution is to map all points to the viewing plane which is
infinite and follow the algorithm above. This requires a custom g function and a good
understanding of the camera model.
3.5.1 GTA 5 Camera Model
Let’s first establish some terminology. Let e ∈ R3
be the location of the observer and
let c ∈ R3
be a point on the viewing plane, the plane where the image of the world is
formed, such that vector p from e to c represents the direction the camera is pointing
and is perpendicular to the viewing plane. Additionally, let θ be a rotation vector of the
camera relative to the world coordinates. After a lot of experimentation, we determined
that the position property of the camera object in GTA 5 refers to e. θ measures angles
counterclockwise in degrees. When θ = 0, the camera is facing down the positive y-axis
and the view plane is thus the xz-plane. The order of rotation from this position is
around x-axis then y-axis and then z-axis.
3.5.2 From 3D to 2D
Based on the information about the camera model, we can take a 3 dimensional point
in the world and then map it to the viewing plane and then transform it to screen pixels.
Let a ∈ R3
be the point we wish to map. First we must transform this point to the
camera coordinates. This is accomplished by rotating a using the equations below and
subtracting c, the subtraction is omitted.
17
Figure 5: Camera model and parameters in GTA 5


dx
dy
dz

 =


cos(θx) −sin(θx) 0
sin(θx) cos(θx) 0
0 0 1




cos(θy) 0 sin(θy)
0 1 0
−sin(θy) 0 cos(θy)




1 0 0
0 cos(θx) −sin(θx)
0 sin(θx) cos(θx)




ax
ay
az


dx = cos(θz)[axcos(θy) + sin(θy)[aysin(θx) + azcos(θx)]] − sin(θz)[aycos(θx) − azsin(θx)]
dy = sin(θz)[axcos(θy) + sin(θy)[aysin(θx) + azcos(θx)]] + cos(θz)[aycos(θx) − azsin(θx)]
dz = −axsin(θy) + cos(θy)[aysin(θx) + azcos(θx)]
We also need to rotate the vector representing the z direction in the world, vup,world and
the vector representing the x direction in the world, vx,world. We also need to compute
the width and hight of the region of the view plane which is actually displayed on screen.
We call this region the view window. In the equations below F is the field of view in
radians and dnear clip is the distance between c and e.
viewWindowHeight = 2 ∗ dnear cliptan(F/2)
viewWindowWidth =
IMAGE WIDTH
IMAGE HEIGHT
∗ viewWindowHeight
18
We then compute the intersection point between vector d − e and the viewing plane,
call it pplane. We translate the origin to the upper left corner of the view window and
update pplane to pplane.
newOrigin = c +
viewWindowHeight
2
∗ vup,camera −
viewWindowWidth
2
∗ vx,camera
pplane = (pplane + c) − newOrigin
Next we calculate the coordinates of pplane in the two dimensions of the plane.
viewPlaneX =
p T
planevx,camera
vT
x,cameravx,camera
viewPlaneZ =
p T
planevup,camera
vT
up,cameravup,camera
Finally we scale the coordinates to the size of the screen. UI.WIDTH and UI.HEIGHT
are in-game constants.
screenX =
viewPlaneX
viewWindowWidth
∗ UI.WIDTH
screenY =
−viewPlaneZ
viewWindowHeight
∗ UI.HEIGHT
The process is summarized below.
Algorithm 2 get2Dfrom3D: Algorithm for computing screen coordinates of a 3D point.
Require: a
1: translate and rotate a into camera coordinates point d
2: rotate vup,world, vx,world to vup,camera, vx,camera
3: compute viewWindowHeight, viewWindowWidth
4: find intersection of d − e with the viewing plane
5: translate origin of the viewing plane
6: calculate the coordinates of the intersection point in the plane
7: scale the coordinates to screen size in pixels
19
3.5.3 General Approach To Annotation of Objects
The main objective is to draw bounding boxes around objects which are within a
certain distance. There exist functions GetNearbyVehicles, GetNearbyPeds, and Get-
NearbyEntities. These functions allows us to get an array of all cars, pedestrians and
objects in an area around the test car. Each object can be tested individually to see if
it is visible on the screen. We created a custom function for doing so as the in game
function has unreliable behavior. This function works by checking if it is possible to
draw a strait line between e and at least one of the vertices of the bounding cube without
hitting any other object. The name of this methods is ray casting and it will be discussed
in more detail later. It must be noted that in the hierarchy of the game, pedestrians and
vehicles are also entities. Therefore a filtering process is applied when bounding signs.
This process is discussed in the signs section.
3.5.4 Cars
Compared to TORCS, GTA 5 has almost ten time more car models. There are 259
vehicles in GTA V (See http://www.ign.com/wikis/gta-5/Vehicles for the complete list).
There vehicles are of various shapes and sizes, from golf carts to truck and trailers. This
diversity is more representative of the real distribution of vehicles and can hopefully be
utilized to train more accurate neural networks. The above method can put a bounding
box around any of these vehicles. Please see Figures 6, 7, and 8 for examples.
3.5.5 Pedestrians
Pedestrians can also be bounded for classification and localization training. GTA 5
has pedestrians of various genders and ethnicities. More importantly, the pedestrians in
GTA 5 perform various actions like standing, crossing streets, sitting etc. This creates a
lot of diversity for training. The draw back of GTA 5 is that all pedestrians are about
the same height.
3.5.6 Signs
As mentioned before, signs are a bit more tricky to bound. There are two reasons for
this. First, the only way to find get signs which are around the test vehicle is to get all
entities. This includes cars, pedestrians, and various miscellaneous props, many of which
20
Figure 6: Two cars bounded in boxes. Weather: rain.
Figure 7: Two cars bounded in boxes.
21
Figure 8: Traffic jam bounded in boxes.
Figure 9: Pedestrians bounded in boxes.
22
Sign Description DOT Id [3] GTA Picture
Stop Sign R1-1
Yield Sign R1-2
One Way Sign R6-1
No U-Turn Sign R3-4
Freeway Entrance D13-3
Do Not Enter Wrong Way Sign R5-1 and R5-1a
Figure 10: Some of the traffic signs present in GTA 5.
are of no interest. Thus we need to check each entity for its model to see if it is a traffic
sign. To do so, we need a list of all of the models of all traffic signs in GTA 5. This list
would include many of the signs listed in Uniform Traffic Control Devices [3]. See Figure
10 for some of the signs in GTA 5.
The second difficulty with traffic signs is that they may require more than one bounding
box. For example, a traffic light may have several lights on it, see figure 12. This leads to
the idea of spaces of interest, or SOP. One sign model may have several space of interest
we wish to bound.
23
Figure 11: Stop sign in bounding box.
Figure 12: Traffic lights in bounding boxes.
24
There is an elegant solution to both problems. The solution is a database of spaces
of interest. Every entry contains a model hash code, name of the sign, and the x,y,z
coordinates of the front upper left and back lower right vertices of the bounding cube.
Which such a database, the algorithm for bounding sign is as follows:
Algorithm 3 Algorithm for bounding signs.
Require: d - database of spaces of interest
1: read in d
2: get array of entities e from GetNearbyEntities
3: for each entity in e do
4: check if the model of the entity matches any hash codes in d
5: get all the matching spaces of interest
6: for each space of interest do
7: draw a bounding box
3.6 Pixel Maps
Pixel maps are more refined bounding boxes. Instead of marking an entity with four
pixels, we mark it with every pixel it occupies on the screen. This can be done easily when
we start with a bounding box b = {(xmin, ymin), (xmax, ymax)} and invert the function
which maps 3 dimensional point to the screen. The inverse of g can be constructed
as follows. Given a screenX and screenY in pixels, we transform the pixel values to
coordinates on the viewing plane. Next, we transform the point on the viewing plane
into a point in the 3 dimensional world, pworld.
viewPlaneX =
screenX
UI.WIDTH
∗ viewWindowWidth
viewPlaneZ =
−screenY
UI.HEIGHT
∗ viewWindowHeight
pworld = viewPlaneX ∗ vx,camera + viewPlaneZ ∗ vup,camera + newOrigin
Once we compute pworld, we use Raycast function to get the entity which occupies that
pixel. They Raycast function requires a point of origin, in our case e, a direction, in our
case pworld −e and a maximum distance the ray should travel, which we could set to be a
very large number like 10,000. If the entity returned by Raycast matches the entity the
bounding box encloses, then we added the pixel to the map.
25
Algorithm 4 Algorithm for computing a pixel map of an entity.
Require: entity, b = {(xmin, ymin), (xmax, ymax)}
1: let map be a boolean array IMAGE WIDTH by IMAGE HEIGHT
2: for x ∈ {xi|xi ∈ Z, xmin ≤ xi ≤ xmax} do
3: for y ∈ {yi|yi ∈ Z, ymin ≤ yi ≤ ymax} do
4: compute pworld of x, y
5: Raycast from e in direction of pworld − e to get entityRaycast
6: if entity = entityRaycast then
7: set map[x, y] to true
Depending on the application, these maps can be combined together using the OR
boolean function. The function for pixel maps is yet to be implemented due to time
constraints. Besides being a trivial extension of bounding boxes, it is also less useful for
machine learning due to a cumbersome and perhaps unnecessarily complex representation
of objects. Figures 13 and 14 show what the result of such a function would look like.
3.7 Road Lanes
Identifying and locating cars, pedestrians, and signs will only help with a part of the
driving task. Even without any of these things present, drivers must still stay within a
specified lane. Ultimately, locating the lanes and the vehicle’s position in them is the
foundation of the driving task. We will explore a method for extracting information
similar to [8] from GTA 5.
3.7.1 Notes on Drivers
First, let’s examine how real drivers collect information on lane positions. There is
ample literature on the topic. The general consensus is that humans look ahead about 1
second to locate lanes. [10] [20] [19] This time applies for speeds between 30 kmh and
60 kmh [10] [19] and corresponds to a distance of about 10 meters. In a more detailed
model, human drivers have 2 distances at which they collect information. At 0.93 s or
15.7 m road curvature information is collected [19] and at 0.53 or 9 m position in lane
is collected [19]. Near information is used to fine tune driving and is sufficient at low
speeds [19]. At high speeds, the further level is used for guidance and stabilization [10].
Divers also look about 5.5 degrees below the true horizon for road data. [19] For curves,
humans use a tangent point on the inside of the curve for guidance [20]. They locate this
point 1 to 2 seconds before entering the curve. [20]
26
Figure 13: Image with a bounding box.
27
Figure 14: Image with a pixel map for a car applied.
28
3.7.2 Indicators
From literature on human cognition, we know where people look for information on
road lanes. In [8], we find a very useful model on what information to collect. Chenyi et
al. system uses 13 indicators for navigating down a highway like racetrack. While this
roadway is very simple compared to real world road which have exits, entrances, shared
left turn lanes, and lane merges, the indicators are quite universal. Figure 15 lists the
indicators, their descriptions, and ranges.
3.7.3 Road Network in GTA 5
The GTA 5 road network is composed of 74,530 nodes and 77,934 links. [2] For each
node there are x, y, z coordinates and 19 flags and each link consists of 2 node ids and 4
flags. [2] This information is contained in paths.ipl. Figures 16 and 17 show which flags
are currently known. It does not appear that any of these flags would be particularly
useful to figuring out the location of the lane markings.
The Federal Highway Administration sets lane width for freeway lane at 3.6 m (12
feet) and for local roads between 2.7 m and 3.6 m. Ramps are between 3.6 and 9 m (12
to 30 feet). [1]. Based on measurements, the lanes in GTA 5 are 5.6 meters wide. This
should not be a problem when the network is applied to real world applications since the
output can always be scaled.
3.7.4 Finding the Lanes
We know what information we would like to collect and we know that we want to
collect it at a point in the road about 10 meters in front of the test car. Figure 18
represents our data collection situation. We want to compute where the lanes are at blue
line. Assuming we could locate the markings for the left, right and middle lanes, we
could then see if there are any cars whose positions fall between these points. The cars
would also have to be visible on the screen and no further then some maximum distance.
Following [8], this distance d could be 70 meters.
We can compute the indicators if we know the position of the lanes and the heading
of the road. Let h be the heading vector of the road at the 10 meter mark. Let LL, ML,
MR, and RR be points on the lane markings where the blue line intersects the lanes.
Let f be a point on the ground at the very front of the test vehicle, possibly below the
camera. We will perform the calculation for the three lane indicators are the two lane
indicator can be filled in with values set based on these indicators. The angle is simply
the angle between the test car heading vector, hcar and the road heading vector.
29
Indicators
Indicator Description Min Value Max Value
angle
angle between the cars heading and
the tangent of the road
-0.5 0.5
dist L
distance to the preceding car in the
left lane
0 75
dist R
distance to the preceding car in the
right lane
0 75
toMarking L distance to the left lane marking -7 -2.5
toMarking M distance to the central lane marking -2 3.5
toMarking R distance to the right lane marking 2.5 7
dist LL
dist LL: distance to the preceding car
in the left lane
0 75
dist MM
dist MM: distance to the preceding
car in the current lane
0 75
dist RR
dist RR: distance to the preceding car
in the right lane
0 75
toMarking LL
distance to the left lane marking of
the left lane
-9.5 -4
toMarking ML
distance to the left lane marking of
the current lane
-5.5 -0.5
toMarking MR
distance to the right lane marking of
the current lane
0.5 5.5
toMarking RR
distance to the right lane marking of
the right lane
4 9.5
Figure 15: List of indicators, their ranges and positions. Distances are in meters, and
angles are in radians. Graphic reproduced from [8].
30
Flag Meaning
0 0 (primary) or 1 (secondary or tertiary)
1 0 (land), 1 (water)
2 unknown (0 for all nodes)
3 unknown (1 for 65,802 nodes, otherwise 0, 2, or 3)
4 0 (road), 2 (unknown), 10 (pedestrian), 14 (interior), 15 (stop), 16 (stop), 17 (stop),
18 (pedestrian), 19 (restricted)
5 unknown (from 0/15 to 15/15)
6 unknown (0 for 60,111 nodes, 1,141 other values)
7 0 (road) or 1 (highway or interior)
8 0 (primary or secondary) or 1 (tertiary)
9 0 (most nodes) or 1 (some tunnels)
10 unknown (0 for all nodes)
11 0 (default) or 1 (stop - turn right)
12 0 (default) or 1 (stop - go straight)
13 0 (major) or 1 (minor)
14 0 (default) or 1 (stop - turn left)
15 unknown (1 for 10,455 nodes, otherwise 0)
16 unknown (1 for 32 nodes, otherwise 0, on highways)
17 unknown (1 for 62 nodes, otherwise 0, on highways)
18 unknown (1 for 92 nodes, otherwise 0, some turn lanes)
Figure 16: Flags for links. [2]
Flag Meaning
0 unknown (-10, -1 to 8 or 10)
1 unknown (0 to 4 or 6)
2 0 (one-way), 1 (unknown), 2 (unknown), 3 (unknown)
3 0 (unknown), 1 (unknown), 2 (unknown), 3 (unknown), 4 (unknown), 5 (unknown), 8
(lane change), 9 (lane change), 10 (street change), 17 (street change), 18 (unknown),
19 (street change)
Figure 17: Flags for nodes. [2]
angle = cos−1
(
h · hcar
||h||||hcar||
)
For toMarking LL, toMarking ML, toMarking MR, toMarking RR, we will assume
that lanes are straight lines. We have a point on those lines and a vector indicating the
direction they are heading. This assumption is crude, however at the distances we are
discussing it should not produce large errors. Additionally, we could adjust the distance
31
Figure 18: Blue line represents where we want to collect data on lane location.
32
at which we sample date based on road heading. This would not only be more in line
with human behavior [10] [20] [19], it would also reduce errors. To compute the distance
we must project vector f − LL on to vector −h and compute the distance between the
projected point and f − LL. We will work out the mathematics for the left marking of
the left lane, LL.
r = proj−h(f − LL) =
(f − LL) · (−h)
|| − h||2
− h
toMarking LL = ||(f − LL) − r||
To compute dist LL, dist MM, dist RR, we must first figure out which vehicles are in
which lanes. For all the vehicles returned by GetNearbyVehicles, we can eliminate any
whose heading vector form an angle of more than 90 degrees with the heading of the
road. The position of the vehicle, p, must be within a rectangular prism formed by LL,
RR, f and f + d ∗ h in the direction normal to the ground which is also other world up
vector for the test car, vup. This can be computed by projecting LL − f, RR − f, and
d ∗ h onto the plane of f and vup. The following are the projections of the points.
rLL = LL − projvup (LL) =
LL · vup
||vup||2
vup
rRR = RR − projvup (RR) =
RR · vup
||vup||2
vup
rf+d∗h = (f + d ∗ h) − projvup ((f + d ∗ h)) =
(f + d ∗ h) · vup
||vup||2
vup
rp = (p) − projvup ((p)) =
(p) · vup
||vup||2
vup
Now we just have to check that the y coordinate of rp is between the y coordinates of
rLL and rRR and the x coordinate of rp is between 0 and the x coordinate of rf+d∗h. If
the vehicle satisfies these bounds, we can compute its distance to all lane marking in the
same way we did for the test vehicle. We then check to which marking it is closest to
and assign it to that lane, or perform additional logic. Let assume it is the left lane. We
perform the following to compute dist LL.
r = projhcar (p − f) =
(p − f) · (hcar)
||hcar||2
hcar
dist LL = ||r||
33
Algorithm 5 Algorithm for computing dist LL, dist MM, and dist RR.
Require: entity, b = {(xmin, ymin), (xmax, ymax)}
1: create arrays dist LLs, dist MMs, and dist RRs and add d to each
2: l is the lane of the vehicle
3: for each vehicle v returned by GetNearbyVehicles do
4: if cos−1
( h·hcar
||h||||hcar||
) < π
2
then
5: if p is in the three lanes, in front of test car, and close then
6: compute toMarking LL, toMarking ML, toMarking MR, and toMark-
ing RR for p
7: if toMarking LL is smallest then
8: l = left lane
9: if toMarking RR is smallest then
10: l = right lane
11: if toMarking ML is smallest AND toMarking LL < toMarking MR then
12: l = left lane
13: else
14: l = middle lane
15: if toMarking MR is smallest AND toMarking RR < toMarking MR then
16: l = right lane
17: else
18: l = middle lane
19: if l = right lane then
20: add ||projhcar (p − f)|| to dist RRs
21: else if l = left lane then
22: add ||projhcar (p − f)|| to dist LLs
23: else
24: add ||projhcar (p − f)|| to dist MMs
25: dist RR = min dist RRs
26: dist LL = min dist LLs
27: dist MM = min dist MMs
34
To perform the above computation we need a vector representing the heading of the
road and a point on each lane marking. This is where the challenge begins. We cannot use
any of the functions or methods discussed for objects because roads and lane markings
are not entities. The road is part of the terrain and the lanes are a texture. Therefore,
we cannot get the width of the road model or the position of a lane marking the way we
obtained those properties for cars.
GTA 5 has realistic traffic. There are many AI driven cars in the game which navigate
the road network while staying in lanes. Therefore, the game engine knows the location
of the lane markings. There are several functions which pertain to roads. GetStreetName
returns the name of a street at a specified point in the world. IS POINT ON ROAD is
a native function which checks if a point in on a road. There are also several functions
which deal with vehicle nodes.
Vehicle nodes appear to be the primary way the graph of the road network is repre-
sented in the game. Every vehicle node is a point at the center of the road, as seen in
Figure 19. They are spaced out in proportion to the curvature of the road; close together
at sharp corners and further apart on straight stretches of road. Each node has an unique
id.
The main functions for working with nodes include GET NTH CLOSEST VEHICLE NODE
and GET NTH CLOSEST VEHICLE NODE ID. A way to call them in a script is shown
in the code snippet below. In this code snippet, the ”safe” arguements serve an unknown
purpose, as do the two zeros in GET NTH CLOSEST VEHICLE NODE ID. The i vari-
able specified which node in the order of proximity should be selected. There is also a
function GET VEHICLE NODE PROPERTIES. However, we could not find a way to
get this function to work.
OutputArgument safe1 = new OutputArgument();
OutputArgument safe2 = new OutputArgument();
OutputArgument safe3 = new OutputArgument();
Vector3 midNode;
OutputArgument outPosArg = new OutputArgument();
Function.Call(Hash.GET_NTH_CLOSEST_VEHICLE_NODE,
playerPos.X, playerPos.Y, playerPos.Z, i, outPosArg, safe1, safe2, safe3);
midNode = outPosArg.GetResult<Vector3>();
int nodeId = Function.Call<int>(Hash.GET_NTH_CLOSEST_VEHICLE_NODE_ID,
playerPos.X, playerPos.Y, playerPos.Z, i, safe1, 0f, 0f);
35
Figure 19: Red markers represent locations of vehicle nodes.
36
The benefit of this system is that we can locate our car on the network by getting
the closest node. Given the road heading and lane width, it is possible to compute the
centers of lanes as seen in Figure 20. The problem is that there is no way of getting the
heading of the road and the number as well position of lanes around the node as far as
we could find.
Figure 20: Red markers represent locations of vehicle nodes. Blue markers are extrapo-
lations of lane middles based on road heading and lane width.
A promising approach to solving this problem was road model fitting. We know that
the node is at the center of the road. We do not know if it is on a lane marking or
in a middle of a lane. We could assume that it is on a lane marking and then count
the number of lanes on the left and right. This could be done by moving a lane width
over and checking if the point is still on the road by using IS POINT ON ROAD and
GetStreetName. We can repeat the same method under the assumption that the node
37
is in the middle of a lane. Whichever assumption found more lanes, that is the correct
assumption as the wrong one will not count the outer most lanes. This still leaves the
question of finding the heading of the road and if the node is between lanes going in
opposite directions. However, there are two fundamental problems with this approach
which make it useless. First, this approach assumes that the nodes at centers of lanes
or on lane markings. Upon further exploration, we found that nodes can be on medians;
for example Figure 21. This is still the center of the road, just not where we expect it.
Second, IS POINT ON ROAD is not a reliable indicator of whether a point is actually
on a road. Sometimes is returns false for points which are clearly on the road, and
sometimes it returns true for points which are on the side of the road.
Figure 21: Red markers represent locations of vehicle nodes. Blue markers are extrapo-
lations of lane middles based on road heading and lane width. The blue marker in front
of the test car represents where we want to measure lanes.
38
There are two solutions to this problem. The first solution is to keep hacking at the
game until we find all of this information. The information we are looking for must be
somewhere in the game because the game AI knows where to drive. It knows where the
lanes are and how to stay in them. The second solution is to build a database of nodes.
Figure 22 lists the data which would be stored in this database.
Field Meaning
nodeId The numerical id of the node.
onMarking True if the node is on a lane marking or in the middle of a lane.
oneWay True if the traffic on both sides of the node moves in the same direction.
leftStart Vector representing the point where the road begins left of the node.
leftEnd Vector representing the point where the road ends left of the node.
rightStart Vector representing the point where the road begins right of the node.
rightEnd Vector representing the point where the road ends right of the node.
heading Vector representing the heading of the road.
Figure 22: Node database entry design.
The problem with this method is that there are over 70,000 nodes and there does not
appear to be an easy way of collecting this information. At the present moment, there
does not appear to be a simpler solution.
39
4 Towards The Ultimate AI Machine
Previous section outlined methods for getting information out of GTA 5 to create
datasets. To fully utilize GTA 5, we still need to create a database of nodes and spaces
of interest. Once that is done, we will move on to creating datasets and training neural
networks.
The objective of harvesting this data has be emphasized as training data for neural
networks. However, the ultimate goal is much grander. That is building a system which
can master driving in GTA 5. This system would probably include several neural networks
and perhaps other statistical models. For example, this system may include a network for
locating pedestrians, a SVM for classifying street signs, another network for recognizing
traffic lights etc. All of these components would be linked together by some master
program that would construct the most likely world model based on all of these ”sensors”.
Then another program would be responsible for driving the car. Since we can extract
data from GTA 5 in real time, we can test how well this system would work in changing
conditions.
In the process of building such a system, it is possible to test out some new ideas
in neural networks. We would like to continue to explore curriculum learning [5] and
self-paced learning [18] [16] as means of presenting examples in order of difficult. Since
these ideas have been applied to object tracking in video [28], teaching robots motor skills
[17], matrix factorization [31], handwriting recognition [22], and multi-task learning [26];
surpassing state-of-the-art benchmarks, we hope that they could be used to improve
autonomous driving. Another interesting idea is transfer learning [25], or the ability
to use a network trained in one domain in another domain. This could be applied in
pedestrian and sign classifiers. Lastly, we have been working on ways to use optimal
learning to select best neural network architectures. It would be interesting to try those
methods in this application.
Building this system presents 2 major difficulties. First, both the game and the neural
networks are GPU intensive processes. Running both on a single machine would require a
lot of computational power. Second, GTA 5 will only work on Windows PCs, while most
deep learning libraries are Linux based. Porting either application is close to infeasible.
Last semester, we, working with Daniel Stanley and Bill Zhang, constructed a solution
for running GTA 5 with TorcsNet from [8]. The idea was to run the processes on separate
machines and have them communicate via a shared folder on a local network, see Figure
23. During the tests, the amount of data transfered was small, a text file of 13 floats and
a 280 by 210 png image. This setup should is fast enough for the system run at around
10 Hz.
40
Figure 23: GTA V Experimental Setup
4.1 Future Research Goals
Build a database of GTA V road nodes
Build a database of GTA V road signs
Train sign classifier
Train traffic lights classifier
Compare how well GTA V trained classifier works on real datasets
Check how well the TORCS network can identify cars in GTA V
Build a robust controller in GTA V which uses all 13 indicators
Explore the effects of curriculum learning on driving performance
Explore transfer learning and optimal learning for neural networks
Test trained models in a real vehicle (PAVE)
41
A Screenshot Function
private struct Rect
{
public int Left;
public int Top;
public int Right;
public int Bottom;
}
[DllImport("C:WindowsSystem32user32.dll")]
private static extern IntPtr GetForegroundWindow();
[DllImport("C:WindowsSystem32user32.dll")]
private static extern IntPtr GetClientRect(IntPtr hWnd, ref Rect rect);
[DllImport("C:WindowsSystem32user32.dll")]
private static extern IntPtr ClientToScreen(IntPtr hWnd, ref Point point);
void screenshot(String filename)
{
//UI.Notify("Taking screenshot?");
var foregroundWindowsHandle = GetForegroundWindow();
var rect = new Rect();
GetClientRect(foregroundWindowsHandle, ref rect);
var pTL = new Point();
var pBR = new Point();
pTL.X = rect.Left;
pTL.Y = rect.Top;
pBR.X = rect.Right;
pBR.Y = rect.Bottom;
ClientToScreen(foregroundWindowsHandle, ref pTL);
ClientToScreen(foregroundWindowsHandle, ref pBR);
Rectangle bounds = new Rectangle(pTL.X, pTL.Y, rect.Right - rect.Left,
rect.Bottom - rect.Top);
using (Bitmap bitmap = new Bitmap(bounds.Width, bounds.Height))
{
using (Graphics g = Graphics.FromImage(bitmap))
{
42
g.ScaleTransform(.2f, .2f);
g.CopyFromScreen(new Point(bounds.Left, bounds.Top), Point.Empty, bounds.Size);
}
Bitmap output = new Bitmap(IMAGE_WIDTH, IMAGE_HEIGHT);
using (Graphics g = Graphics.FromImage(output))
{
g.DrawImage(bitmap, 0, 0, IMAGE_WIDTH, IMAGE_HEIGHT);
}
output.Save(filename, ImageFormat.Bmp);
}
}
43
References
[1] Lane width. http://safety.fhwa.dot.gov/geometric/pubs/
mitigationstrategies/chapter3/3_lanewidth.cfm. Accessed: 2016-4-29.
[2] Paths (gta v). http://gta.wikia.com/wiki/Paths_(GTA_V). Accessed: 2016-4-29.
[3] F. H. Administration. Manual on uniform traffic control devices. 2009.
[4] A. R. Atreya, B. C. Cattle, B. M. Collins, B. Essenburg, G. H. Franken, A. M. Saxe,
S. N. Schiffres, and A. L. Kornhauser. Prospect eleven: Princeton university’s entry
in the 2005 darpa grand challenge. Journal of Field Robotics, 23(9):745–753, 2006.
[5] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In
Proceedings of the 26th annual international conference on machine learning, pages
41–48. ACM, 2009.
[6] C. G. C. D. R. C. A. S. Bernhard Wymann, Eric Espie. Torcs the open racing car
simulator. http://www.torcs.org, 2014.
[7] S. M. Bileschi. StreetScenes: Towards scene understanding in still images. PhD
thesis, Citeseer, 2006.
[8] C. Chen, A. Seff, A. Kornhauser, and J. Xiao. Deepdriving: Learning affordance for
direct perception in autonomous driving. arXiv preprint arXiv:1505.00256, 2015.
[9] P. Doll´ar, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: A benchmark.
In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference
on, pages 304–311. IEEE, 2009.
[10] E. Donges. A two-level model of driver steering behavior. Human Factors: The
Journal of the Human Factors and Ergonomics Society, 20(6):691–707, 1978.
[11] F. Flohr, D. M. Gavrila, et al. Pedcut: an iterative framework for pedestrian seg-
mentation combining shape models and multiple data cues. 2013.
[12] J. Fritsch, T. Kuehnl, and A. Geiger. A new performance measure and evaluation
benchmark for road detection algorithms. In International Conference on Intelligent
Transportation Systems (ITSC), 2013.
[13] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti
vision benchmark suite. In Conference on Computer Vision and Pattern Recognition
(CVPR), 2012.
[14] E. Guizzo. How googles self-driving car works. IEEE Spectrum Online, October, 18,
2011.
44
[15] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and C. Igel. Detection of traffic
signs in real-world images: The German Traffic Sign Detection Benchmark. In
International Joint Conference on Neural Networks, number 1288, 2013.
[16] L. Jiang, D. Meng, S.-I. Yu, Z. Lan, S. Shan, and A. Hauptmann. Self-paced learning
with diversity. In Advances in Neural Information Processing Systems, pages 2078–
2086, 2014.
[17] A. Karpathy and M. Van De Panne. Curriculum learning for motor skills. In
Advances in Artificial Intelligence, pages 325–330. Springer, 2012.
[18] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable
models. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta,
editors, Advances in Neural Information Processing Systems 23, pages 1189–1197.
Curran Associates, Inc., 2010.
[19] M. Land, J. Horwood, et al. Which parts of the road guide steering? Nature,
377(6547):339–340, 1995.
[20] M. F. Land and D. N. Lee. Where do we look when we steer. Nature, 1994.
[21] F. Larsson and M. Felsberg. Using fourier descriptors and spatial models for traffic
sign recognition. In Image Analysis, pages 238–249. Springer, 2011.
[22] J. Louradour and C. Kermorvant. Curriculum learning for handwritten text line
recognition. In Document Analysis Systems (DAS), 2014 11th IAPR International
Workshop on, pages 56–60. IEEE, 2014.
[23] M. Mathias, R. Timofte, R. Benenson, and L. Van Gool. Traffic sign recognitionhow
far are we from the solution? In Neural Networks (IJCNN), The 2013 International
Joint Conference on, pages 1–8. IEEE, 2013.
[24] A. Mogelmose, M. M. Trivedi, and T. B. Moeslund. Vision-based traffic sign detec-
tion and analysis for intelligent driver assistance systems: Perspectives and survey.
Intelligent Transportation Systems, IEEE Transactions on, 13(4):1484–1497, 2012.
[25] S. J. Pan and Q. Yang. A survey on transfer learning. Knowledge and Data Engi-
neering, IEEE Transactions on, 22(10):1345–1359, 2010.
[26] A. Pentina, V. Sharmanska, and C. H. Lampert. Curriculum learning of multiple
tasks. arXiv preprint arXiv:1412.1353, 2014.
[27] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by
back-propagating errors. Cognitive modeling, 5(3):1, 1988.
45
[28] J. S. Supancic and D. Ramanan. Self-paced learning for long-term tracking. In Com-
puter Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages
2379–2386. IEEE, 2013.
[29] S. Thrun, M. Montemerlo, H. Dahlkamp, D. Stavens, A. Aron, J. Diebel, P. Fong,
J. Gale, M. Halpenny, G. Hoffmann, et al. Stanley: The robot that won the darpa
grand challenge. Journal of field Robotics, 23(9):661–692, 2006.
[30] T. Veit, J.-P. Tarel, P. Nicolle, and P. Charbonnier. Evaluation of road mark-
ing feature extraction. In Proceedings of 11th IEEE Conference on Intelli-
gent Transportation Systems (ITSC’08), pages 174–181, Beijing, China, 2008.
http://perso.lcpc.fr/tarel.jean-philippe/publis/itsc08.html.
[31] Q. Zhao, D. Meng, L. Jiang, Q. Xie, Z. Xu, and A. G. Hauptmann. Self-paced
learning for matrix factorization. In Twenty-Ninth AAAI Conference on Artificial
Intelligence, 2015.
46

More Related Content

Viewers also liked (15)

Exponentes
ExponentesExponentes
Exponentes
 
Bonetech Profile
Bonetech ProfileBonetech Profile
Bonetech Profile
 
CHAMELEON BUNDLE (1)
CHAMELEON BUNDLE (1)CHAMELEON BUNDLE (1)
CHAMELEON BUNDLE (1)
 
Love the Sinner
Love the SinnerLove the Sinner
Love the Sinner
 
Hmwssb bill payment
Hmwssb bill paymentHmwssb bill payment
Hmwssb bill payment
 
Reglamento campeonatossplf1 2016
Reglamento campeonatossplf1 2016Reglamento campeonatossplf1 2016
Reglamento campeonatossplf1 2016
 
Esmeraldas
EsmeraldasEsmeraldas
Esmeraldas
 
Còmo acreditar los derechos de autor
Còmo acreditar los derechos de autorCòmo acreditar los derechos de autor
Còmo acreditar los derechos de autor
 
URBAN ERASMUS TRAIL BLOCK 1
URBAN ERASMUS TRAIL BLOCK 1URBAN ERASMUS TRAIL BLOCK 1
URBAN ERASMUS TRAIL BLOCK 1
 
historia
historia historia
historia
 
Urgencias Médicas
Urgencias Médicas Urgencias Médicas
Urgencias Médicas
 
Obsługa klienta w social media - Bądź jeden level ponad normą!
Obsługa klienta w social media - Bądź jeden level ponad normą!Obsługa klienta w social media - Bądź jeden level ponad normą!
Obsługa klienta w social media - Bądź jeden level ponad normą!
 
Unidad didactica Matemáticas 2do año Secundaria
Unidad didactica Matemáticas 2do año SecundariaUnidad didactica Matemáticas 2do año Secundaria
Unidad didactica Matemáticas 2do año Secundaria
 
Demo i punkt
Demo i punktDemo i punkt
Demo i punkt
 
Sistemas de rep pp
Sistemas de rep ppSistemas de rep pp
Sistemas de rep pp
 

Similar to Video Games for Autonomous Driving

Vehicle to Vehicle Communication using Bluetooth and GPS.
Vehicle to Vehicle Communication using Bluetooth and GPS.Vehicle to Vehicle Communication using Bluetooth and GPS.
Vehicle to Vehicle Communication using Bluetooth and GPS.Mayur Wadekar
 
UiA Slam (Øystein Øihusom & Ørjan l. Olsen)
UiA Slam (Øystein Øihusom & Ørjan l. Olsen)UiA Slam (Øystein Øihusom & Ørjan l. Olsen)
UiA Slam (Øystein Øihusom & Ørjan l. Olsen)Øystein Øihusom
 
dissertation_hrncir_2016_final
dissertation_hrncir_2016_finaldissertation_hrncir_2016_final
dissertation_hrncir_2016_finalJan Hrnčíř
 
Particle Filter Localization for Unmanned Aerial Vehicles Using Augmented Rea...
Particle Filter Localization for Unmanned Aerial Vehicles Using Augmented Rea...Particle Filter Localization for Unmanned Aerial Vehicles Using Augmented Rea...
Particle Filter Localization for Unmanned Aerial Vehicles Using Augmented Rea...Ed Kelley
 
Smart Traffic Management System using Internet of Things (IoT)-btech-cse-04-0...
Smart Traffic Management System using Internet of Things (IoT)-btech-cse-04-0...Smart Traffic Management System using Internet of Things (IoT)-btech-cse-04-0...
Smart Traffic Management System using Internet of Things (IoT)-btech-cse-04-0...TanuAgrawal27
 
Virtual Environments as Driving Schools for Deep Learning Vision-Based Sensor...
Virtual Environments as Driving Schools for Deep Learning Vision-Based Sensor...Virtual Environments as Driving Schools for Deep Learning Vision-Based Sensor...
Virtual Environments as Driving Schools for Deep Learning Vision-Based Sensor...Artur Filipowicz
 
2000402 en juniper good
2000402 en juniper good2000402 en juniper good
2000402 en juniper goodAchint Saraf
 
IMPLEMENTATION OF IMAGE PROCESSING ALGORITHMS ON FPGA HARDWARE.pdf
IMPLEMENTATION OF IMAGE PROCESSING ALGORITHMS ON FPGA HARDWARE.pdfIMPLEMENTATION OF IMAGE PROCESSING ALGORITHMS ON FPGA HARDWARE.pdf
IMPLEMENTATION OF IMAGE PROCESSING ALGORITHMS ON FPGA HARDWARE.pdfvenkatesh231416
 
Autonomous cargo transporter report
Autonomous cargo transporter reportAutonomous cargo transporter report
Autonomous cargo transporter reportMuireannSpain
 
gps gsm based vehicle tracking system seminar
gps gsm based vehicle tracking system seminargps gsm based vehicle tracking system seminar
gps gsm based vehicle tracking system seminarhiharshal277
 
UIC Systems Engineering Report-signed
UIC Systems Engineering Report-signedUIC Systems Engineering Report-signed
UIC Systems Engineering Report-signedMichael Bailey
 

Similar to Video Games for Autonomous Driving (20)

Vehicle to Vehicle Communication using Bluetooth and GPS.
Vehicle to Vehicle Communication using Bluetooth and GPS.Vehicle to Vehicle Communication using Bluetooth and GPS.
Vehicle to Vehicle Communication using Bluetooth and GPS.
 
Thesis Report
Thesis ReportThesis Report
Thesis Report
 
MSc_Thesis
MSc_ThesisMSc_Thesis
MSc_Thesis
 
UiA Slam (Øystein Øihusom & Ørjan l. Olsen)
UiA Slam (Øystein Øihusom & Ørjan l. Olsen)UiA Slam (Øystein Øihusom & Ørjan l. Olsen)
UiA Slam (Øystein Øihusom & Ørjan l. Olsen)
 
dissertation_hrncir_2016_final
dissertation_hrncir_2016_finaldissertation_hrncir_2016_final
dissertation_hrncir_2016_final
 
Particle Filter Localization for Unmanned Aerial Vehicles Using Augmented Rea...
Particle Filter Localization for Unmanned Aerial Vehicles Using Augmented Rea...Particle Filter Localization for Unmanned Aerial Vehicles Using Augmented Rea...
Particle Filter Localization for Unmanned Aerial Vehicles Using Augmented Rea...
 
Vivarana fyp report
Vivarana fyp reportVivarana fyp report
Vivarana fyp report
 
Smart Traffic Management System using Internet of Things (IoT)-btech-cse-04-0...
Smart Traffic Management System using Internet of Things (IoT)-btech-cse-04-0...Smart Traffic Management System using Internet of Things (IoT)-btech-cse-04-0...
Smart Traffic Management System using Internet of Things (IoT)-btech-cse-04-0...
 
LC_Thesis_Final (1).pdf
LC_Thesis_Final (1).pdfLC_Thesis_Final (1).pdf
LC_Thesis_Final (1).pdf
 
Virtual Environments as Driving Schools for Deep Learning Vision-Based Sensor...
Virtual Environments as Driving Schools for Deep Learning Vision-Based Sensor...Virtual Environments as Driving Schools for Deep Learning Vision-Based Sensor...
Virtual Environments as Driving Schools for Deep Learning Vision-Based Sensor...
 
2000402 en juniper good
2000402 en juniper good2000402 en juniper good
2000402 en juniper good
 
IMPLEMENTATION OF IMAGE PROCESSING ALGORITHMS ON FPGA HARDWARE.pdf
IMPLEMENTATION OF IMAGE PROCESSING ALGORITHMS ON FPGA HARDWARE.pdfIMPLEMENTATION OF IMAGE PROCESSING ALGORITHMS ON FPGA HARDWARE.pdf
IMPLEMENTATION OF IMAGE PROCESSING ALGORITHMS ON FPGA HARDWARE.pdf
 
Autonomous cargo transporter report
Autonomous cargo transporter reportAutonomous cargo transporter report
Autonomous cargo transporter report
 
T401
T401T401
T401
 
Report_Jeremy_Berard
Report_Jeremy_BerardReport_Jeremy_Berard
Report_Jeremy_Berard
 
Honours_Thesis2015_final
Honours_Thesis2015_finalHonours_Thesis2015_final
Honours_Thesis2015_final
 
gps gsm based vehicle tracking system seminar
gps gsm based vehicle tracking system seminargps gsm based vehicle tracking system seminar
gps gsm based vehicle tracking system seminar
 
final_report
final_reportfinal_report
final_report
 
vanet_report
vanet_reportvanet_report
vanet_report
 
UIC Systems Engineering Report-signed
UIC Systems Engineering Report-signedUIC Systems Engineering Report-signed
UIC Systems Engineering Report-signed
 

More from Artur Filipowicz

Smart Safety for Commercial Vehicles (ENG)
Smart Safety for Commercial Vehicles (ENG)Smart Safety for Commercial Vehicles (ENG)
Smart Safety for Commercial Vehicles (ENG)Artur Filipowicz
 
Smart Safety for Commercial Vehicles (中文)
Smart Safety for Commercial Vehicles (中文)Smart Safety for Commercial Vehicles (中文)
Smart Safety for Commercial Vehicles (中文)Artur Filipowicz
 
Incorporating Learning Strategies in Training of Deep Neural Networks for Au...
Incorporating Learning Strategies in Training of Deep Neural  Networks for Au...Incorporating Learning Strategies in Training of Deep Neural  Networks for Au...
Incorporating Learning Strategies in Training of Deep Neural Networks for Au...Artur Filipowicz
 
Learning to Recognize Distance to Stop Signs Using the Virtual World of Grand...
Learning to Recognize Distance to Stop Signs Using the Virtual World of Grand...Learning to Recognize Distance to Stop Signs Using the Virtual World of Grand...
Learning to Recognize Distance to Stop Signs Using the Virtual World of Grand...Artur Filipowicz
 
Virtual Environments as Driving Schools for Deep Learning Vision-Based Sensor...
Virtual Environments as Driving Schools for Deep Learning Vision-Based Sensor...Virtual Environments as Driving Schools for Deep Learning Vision-Based Sensor...
Virtual Environments as Driving Schools for Deep Learning Vision-Based Sensor...Artur Filipowicz
 
Direct Perception for Congestion Scene Detection Using TensorFlow
Direct Perception for Congestion Scene Detection Using TensorFlowDirect Perception for Congestion Scene Detection Using TensorFlow
Direct Perception for Congestion Scene Detection Using TensorFlowArtur Filipowicz
 
Learning to Recognize Distance to Stop Signs Using the Virtual World of Grand...
Learning to Recognize Distance to Stop Signs Using the Virtual World of Grand...Learning to Recognize Distance to Stop Signs Using the Virtual World of Grand...
Learning to Recognize Distance to Stop Signs Using the Virtual World of Grand...Artur Filipowicz
 
Filtering of Frequency Components for Privacy Preserving Facial Recognition
Filtering of Frequency Components for Privacy Preserving Facial RecognitionFiltering of Frequency Components for Privacy Preserving Facial Recognition
Filtering of Frequency Components for Privacy Preserving Facial RecognitionArtur Filipowicz
 
Desensitized RDCA Subspaces for Compressive Privacy in Machine Learning
Desensitized RDCA Subspaces for Compressive Privacy in Machine LearningDesensitized RDCA Subspaces for Compressive Privacy in Machine Learning
Desensitized RDCA Subspaces for Compressive Privacy in Machine LearningArtur Filipowicz
 

More from Artur Filipowicz (9)

Smart Safety for Commercial Vehicles (ENG)
Smart Safety for Commercial Vehicles (ENG)Smart Safety for Commercial Vehicles (ENG)
Smart Safety for Commercial Vehicles (ENG)
 
Smart Safety for Commercial Vehicles (中文)
Smart Safety for Commercial Vehicles (中文)Smart Safety for Commercial Vehicles (中文)
Smart Safety for Commercial Vehicles (中文)
 
Incorporating Learning Strategies in Training of Deep Neural Networks for Au...
Incorporating Learning Strategies in Training of Deep Neural  Networks for Au...Incorporating Learning Strategies in Training of Deep Neural  Networks for Au...
Incorporating Learning Strategies in Training of Deep Neural Networks for Au...
 
Learning to Recognize Distance to Stop Signs Using the Virtual World of Grand...
Learning to Recognize Distance to Stop Signs Using the Virtual World of Grand...Learning to Recognize Distance to Stop Signs Using the Virtual World of Grand...
Learning to Recognize Distance to Stop Signs Using the Virtual World of Grand...
 
Virtual Environments as Driving Schools for Deep Learning Vision-Based Sensor...
Virtual Environments as Driving Schools for Deep Learning Vision-Based Sensor...Virtual Environments as Driving Schools for Deep Learning Vision-Based Sensor...
Virtual Environments as Driving Schools for Deep Learning Vision-Based Sensor...
 
Direct Perception for Congestion Scene Detection Using TensorFlow
Direct Perception for Congestion Scene Detection Using TensorFlowDirect Perception for Congestion Scene Detection Using TensorFlow
Direct Perception for Congestion Scene Detection Using TensorFlow
 
Learning to Recognize Distance to Stop Signs Using the Virtual World of Grand...
Learning to Recognize Distance to Stop Signs Using the Virtual World of Grand...Learning to Recognize Distance to Stop Signs Using the Virtual World of Grand...
Learning to Recognize Distance to Stop Signs Using the Virtual World of Grand...
 
Filtering of Frequency Components for Privacy Preserving Facial Recognition
Filtering of Frequency Components for Privacy Preserving Facial RecognitionFiltering of Frequency Components for Privacy Preserving Facial Recognition
Filtering of Frequency Components for Privacy Preserving Facial Recognition
 
Desensitized RDCA Subspaces for Compressive Privacy in Machine Learning
Desensitized RDCA Subspaces for Compressive Privacy in Machine LearningDesensitized RDCA Subspaces for Compressive Privacy in Machine Learning
Desensitized RDCA Subspaces for Compressive Privacy in Machine Learning
 

Video Games for Autonomous Driving

  • 1. Driving School II Video Games for Autonomous Driving Independent Work Artur Filipowicz ORFE Class of 2017 Advisor Professor Alain Kornhauser arturf@princeton.edu May 3, 2016 Revised August 27, 2016 1
  • 2. Abstract We present a method for generating datasets to train neural networks and other statistical models to drive vehicles. In [8], Chen et al. used a racing simulator called Torcs to generate a dataset of driving scenes which they then used to train a neural network. One limitation of Torcs is a lack of realism. The graphics are plain and the only roadways are racetracks, which means there are no intersections, pedestrian crossings, etc. In this paper we employ a game call Grand Theft Auto 5 (GTA 5). This game features realistic graphics and a complex transportation system of roads, highways, ramps, intersections, traffic, pedestrians, railroad crossings, and tunnels. Unlike Torcs, GTA 5 has more car models, urban, suburban, and rural environments, and control over weather and time. With the control of time and weather, GTA 5 has an edge over conventional methods of collecting datasets as well. We present methods for extracting three particular features. We create a function for generating bounding boxes around cars, pedestrians and traffic signs. We also present a method for generating pixel maps for objects in GTA 5. Lastly, we develop a way to compute distances to lane markings and other indicators from [8] 2
  • 3. Acknowledgments I would like to thank Professor Alain L. Kornhauser for his mentorship during this project and Daniel Stanley and Bill Zhang for their help over the summer and last semester. This paper represents my own work in accordance with University regulations. Artur Filipowicz 3
  • 4. Contents 1 From The Driving Task to Machine Learning 6 1.1 The Driving Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 The World Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Datasets for the Driving Task 10 2.1 Cars, Pedestrians, and Cyclysis . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Lanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Observations on Current Datasets . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Video Games and Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3 Sampling from GTA 5 12 3.1 GTA 5 Scripts Development . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Test Car . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Desired Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.4 Screenshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.5 Bounding Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.5.1 GTA 5 Camera Model . . . . . . . . . . . . . . . . . . . . . . . . 17 3.5.2 From 3D to 2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.5.3 General Approach To Annotation of Objects . . . . . . . . . . . . 20 3.5.4 Cars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.5.5 Pedestrians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.5.6 Signs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.6 Pixel Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.7 Road Lanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.7.1 Notes on Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.7.2 Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.7.3 Road Network in GTA 5 . . . . . . . . . . . . . . . . . . . . . . . 29 3.7.4 Finding the Lanes . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4 Towards The Ultimate AI Machine 40 4.1 Future Research Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 A Screenshot Function 42 List of Figures 1 Graphics and roads in Torcs. . . . . . . . . . . . . . . . . . . . . . . . . . 12 2 Graphics and roads in GTA 5. . . . . . . . . . . . . . . . . . . . . . . . . 12 4
  • 5. 3 Test Vehicle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4 The red dot represents camera location. . . . . . . . . . . . . . . . . . . . 15 5 Camera model and parameters in GTA 5 . . . . . . . . . . . . . . . . . . 18 6 Two cars bounded in boxes. Weather: rain. . . . . . . . . . . . . . . . . 21 7 Two cars bounded in boxes. . . . . . . . . . . . . . . . . . . . . . . . . . 21 8 Traffic jam bounded in boxes. . . . . . . . . . . . . . . . . . . . . . . . . 22 9 Pedestrians bounded in boxes. . . . . . . . . . . . . . . . . . . . . . . . . 22 10 Some of the traffic signs present in GTA 5. . . . . . . . . . . . . . . . . . 23 11 Stop sign in bounding box. . . . . . . . . . . . . . . . . . . . . . . . . . . 24 12 Traffic lights in bounding boxes. . . . . . . . . . . . . . . . . . . . . . . . 24 13 Image with a bounding box. . . . . . . . . . . . . . . . . . . . . . . . . . 27 14 Image with a pixel map for a car applied. . . . . . . . . . . . . . . . . . . 28 15 List of indicators, their ranges and positions. Distances are in meters, and angles are in radians. Graphic reproduced from [8]. . . . . . . . . . . . . 30 16 Flags for links. [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 17 Flags for nodes. [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 18 Blue line represents where we want to collect data on lane location. . . . 32 19 Red markers represent locations of vehicle nodes. . . . . . . . . . . . . . 36 20 Red markers represent locations of vehicle nodes. Blue markers are ex- trapolations of lane middles based on road heading and lane width. . . . 37 21 Red markers represent locations of vehicle nodes. Blue markers are ex- trapolations of lane middles based on road heading and lane width. The blue marker in front of the test car represents where we want to measure lanes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 22 Node database entry design. . . . . . . . . . . . . . . . . . . . . . . . . . 39 23 GTA V Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 41 5
  • 6. 1 From The Driving Task to Machine Learning 1.1 The Driving Task The driving task is a physics problem of moving an object from point a ∈ R4 to b ∈ R4 , with time being the fourth dimension, without colliding with any other object. There are also additional constraints in the form of lane markings, speed limits, and traffic flow directions. Even with all constraints beyond avoiding collisions, the physical problem of finding a navigable path is easy given a model of the world. That is, if the location of all objects and their shapes is known with certainty and the location of the constraints is known, then the task becomes first the computation of a path in a digraph G representing the road network and then for each edge finding unoccupied space and moving the object into it. All of these problems can be solved using fundamental physics and computer science. What makes the driving task difficult in the real world setting is the lack of an accurate world model. In reality we do not have omniscient drivers. 1.2 The World Model People drive, and so do computers to a limited extent. Therefore, omniscience is not necessary. Some subset of the total world model is good enough to perform the driving task. Perhaps with limited knowledge, it is only possible to successful complete the task with a probability less than 1, but the success rate is high enough for people to utilize this form of transport. To drive, we still need a world model. This model is constructed by the means of sensor fusion, the combination of information from several different sensors. In 2005, Princeton University’s entry in the DARPA Challenge, Prospect 11, used radar and cameras to identify and locate obstacles. Based on these measurements and GPS data, the on-board computer would create a world model and find a safe path. [4] In a similar approach, the Google Car uses radar and lidar to map the world around it. [14] Approaches in [4], [14], and [29] appear rather cumbersome and convoluted compared to the human way of creating a world model. Humans have 5 sensors, the eyes, the nose, the ears, the mouth, and the skin. In driving neither taste nor smell nor touch are used to build the world model as all of these senses are mostly cut off from the world outside the vehicle. The driver can hear noises from the outside. However, they can be muffled by the sound of the driver’s own vehicle and many important objects, such as street signs and lane markings, do not make noise. To construct the world model humans predominantly use one sensor, the eyes. We can suspect that there is enough information encoded in visible light coming through the front windshield to build a world model good enough for completing the driving task. However, research on autonomous vehicles - the construction of solution to the driving task using artificial intelligence - stays away from 6
  • 7. approaching the problem in the pure vision way, as noted in [4] and [29]. The reason for this is that vision, computer vision in particular, is difficult. 1.3 Computer Vision Let X ∈ Rh∗w∗c be an image of width w and height h and c colors. As we stated earlier, X has enough information for a human to figure out where lane markings and other vehicles are, identify and classify road signs and perform other measurements to build a world model. Perhaps, maybe several images in a sequence are necessary, although [8] shows that one image can be used to extract a lot of information. The difficult of computer vision is that X is a matrix of numbers representing colors of pixels. In this representation an object can appear very different depending on lighting conditions. Additionally, due to perspective, objects appear in different sizes and therefore occupy different number of pixels, even if the object is the same. These are two of many variations which humans can account for, but naive machine approaches fail. Computer vision is difficult but not impossible. In recent decades, researches used ma- chine learning to enable computers to take X and construct more salient representations. 1.4 Machine Learning The learning task is as follows; given some image Xi, we wish to predict a vector of indicators Yi. Yi could be distances to lane markings, vehicles, locations of street sings etc. and can then be used to construct a world model. To that end, we want to train a function f such that Yi = f(Xi). We say that Xi, Yi ∼ PX,Y . The machine learning approach to this problem mimics humans in more then just the focus on visual information. The primary method of learning images is the use of neural networks, more specifically convolutional neural networks. These statistical models are inspired by neurons which make up the nerves and brain areas responsible for vision. The mathematical abstraction is represented as follows: Let f(xi, W) be a neural network of L hidden layers. The sizes of these layers are l1 to lL. f(xi, W) = gL(W(L)...g3(W(3)g2(W(2)g1(W(1)xi)))...) W = {W(1), W(2), ...W(L)} W(i) ∈ Rli+1× li gi(x) is some activation function. 7
  • 8. The process of training becomes the process of adjusting values of W(i). This first requires some loss function which expresses the error made by the network. A common loss function is L2. We wish to create a neural network model f such that min f L2(T , f) where let D be a dataset of n indicator Yi and image Xi pairs D = {(Xi, Yi)}n i=1 and let R be the training set and let T be the test set. R ⊂ D T ⊂ D R ∩ T = Ø R ∪ T = D |R| = r |T | = t To minimize the loss function with respect to W, the most common method is the use of Back-Propagation Algorithm [27]. Back-Propagation Algorithm uses stochastic gradient decent to find a local minimum of a function. At each iteration j of J, Back- Propagation Algorithm updates W Wj+1 = Wj − η ∂E(W) ∂Wj The two sources of randomness in the algorithm are W0 and the order in which training examples are used π. The initial values of element in matrices in W0 are uniform random variable. The ordering of examples is also often a random sample with replacement of J (xi, yi) ∈ R On an intuitive level, the network adjusts W to extract useful features from the image pixel values Xi. In the process it builds the distribution Xi, Yi ∼ PX,Y . In theory the larger the W the more capacity the network has for extracting and leaning features and representing complex distributions. At the same time, it is also more likely to fit noise in the data a nonsalient features such as clouds. This overfitting causes poor generalization and we need a network which can generalize to many driving scenes. There are several regularization techniques to overcome overfitting. These include L1, L2, dropout, and others. However, these will only be effective if we do have the data adequately represent the domain of PX,Y . This domain for driving scenes is huge considering it includes 8
  • 9. images of all the different kinds of roads, vehicles, pedestrians, street signs, traffic lights, intersections, ramps, lane marking, lighting conditions, weather conditions, times of day and positions of the camera. [8] tested a network in a limited subset of these conditions and they used almost half million images for training. 9
  • 10. 2 Datasets for the Driving Task Machine learning for autonomous vehicles has been studied for years. Therefore, several datasets already exist. These datasets come in two types. There are datasets of objects of interest in the driving scenes which include vehicles (cars, vans, trucks), cyclists, pedestrians, traffic lights, lane markings and street signs. Usually, these datasets provide coordinates of bounding boxes around the objects Yi. These are useful for training localization and classification models. The second type of datasets provide distances to lane markings, cars and pedestrians. These are used to train regression models. Here we will give a brief overview of several of these datasets. 2.1 Cars, Pedestrians, and Cyclysis Daimler Pedestrian Segmentation Benchmark Dataset contains 785 images of pedestri- ans in an urban environment captured by a calibrated stereo camera. The groundtruth consists of true pixel shape and disparity map. [11] CBCL StreetScenes Challenge Framework contains 3,547 images of driving scenes cap- tured with bounding boxes for 5,799 cars, 1,449 pedestrians, 209 cyclists, as well as buildings, roads, sidewalks, stores, tree, and the sky. The images have been captured by photographers from street, crosswalk, and sidewalk views. [7] KITTI Object Detection Evaluation 2012 contains 7481 training images and 7518 test images with each image containing several objects. The total number of objects is 80,256, including cars, pedestrians, and cyclists. The groundtruth includes a bounding box for the object as well as an estimate of the orientation in the bird’s eye view. [13] Caltech Pedestrian Detection Benchmark contains 10 hours of driving in an urban environment. The groundtruth contains 350,000 bounding boxes for 2300 unique pedes- trians. [9] There are several datasets for street signs [23], [21], and [15]. However, these datasets have been made in European countries and therefore they contain European signs which are very different from their US counterparts. Luckily [24] is a dataset of 6,610 images containing 47 different US road signs. For each sign the annotation includes sign type, position, size, occluded (yes/no), and on side road (yes/no). 2.2 Lanes KITTI Road/Lane Detection Evaluation 2013 has 289 training and 290 test images of road lanes with groundtruth consisting of pixels map the road area and the lane the 10
  • 11. vehicle is in. The dataset contains images from three environment urban with unmarked lanes, urban with marked lanes and urban with multiple marked lanes. [12] ROMA lane database has 116 images of different roads with groundtruth pixel positions of visible lane markings. The camera calibration specifies the pixel distance to true horizon and conversions between pixel distances and meters. [30] 2.3 Observations on Current Datasets The above datasets are quite limited. First, most of them are small when compared to the half a million images used in [8]. Second, they do not represent many of the driving conditions such as different weather conditions or times of day; the reason for this is that measuring equipment, especially cameras, can only function in certain conditions. Since this tends to be sunny weather, most of these datasets are collected during such times. Additionally, all of these datasets include some amount of manual labeling which is not feasible when the dataset includes millions of images. 2.4 Video Games and Datasets The problems associated with the datasets would be resolved if we could somehow sample from PX,Y both Xi and Yi without having to spend time to measure Yi. This is not possible in the real world. However, [8] decided to use a virtual world, a racing video game called Torcs [6]. The hope behind this approach is that the game can simulate PX,Y well enough so that the network, once trained, will be able to generalize to the real world. Let us assume that this is true. The main benefit of using Torcs and other video games is access to the game engine. This allows us to extract the true Yi for each Xi we harvest from the screen. Torcs itself has several restrictions which limit it from simulating the range of driving conditions present in the real world. Fundamentally it is a racing game with circular, one-way tracks. The weather and lighting conditions are fixed. The textures are rather simple and thus unrealistic. To overcome these limitations and allow for a more diverse and realistic dataset, we focus on the game called Grand Theft Auto 5 (GTA5). Unlike Torcs, the makers of GTA5 had the funds to create a very realistic world since they were developing a commercial product and not an open-source research tool. GTA5 has hundreds of different vehicles, pedestrians, freeways, intersections, traffic signs, traffic lights, rich textures, and many other elements which create a realistic environment. Additionally, GTA5 has about 14 weather conditions and simulates lighting conditions for 24 hours of the day. To tap into these features, the next section examines ways of extracting various data. 11
  • 12. Figure 1: Graphics and roads in Torcs. Figure 2: Graphics and roads in GTA 5. 3 Sampling from GTA 5 3.1 GTA 5 Scripts Development GTA 5 is a closed source game. There is no out-of-the-box access to the underlying game engine. However, due to the game’s popularity, fans have hacked into it and developed a library of functions for interacting with the game engine. This is done by the use of scripts loaded into the game. The objective of this paper is not to give tutorial on coding scripts for GTA 5, and as such we will keep the discussion of code to a minimum. However, we will explain some of the code and game dynamics for the purpose of reproducibility and presentation of the methods used to extract data. Two tools are needed to write scripts for GTA 5. The first tool is ScritHook by Alexan- der Blade. This tool can be downloaded from: https://www.gta5-mods.com/tools/script- hook-v or http://www.dev-c.com/gtav/scripthookv/. It comes with a useful trainer which provides basic control over many game variables including weather and time. The next tool is a library called Script Hook V .Net by Patrick Mours which allows us to use C# and other .Net languages to write scripts for GTA 5. The library can be downloaded from https://www.gta5-mods.com/tools/scripthookv-net. For full source code and list of functions please see https://github.com/crosire/scripthookvdotnet. 3.2 Test Car To make the data collection more realistic we will use an in-game vehicle, the test car, with a mounted camera; similar to [13]. The vehicle model for the test car was picked arbitrarily and can be replaced with any other model. Besides the steering controls, we introduce 3 new functions for the following keys: NumPad0, ”I”, and ”O”. NumPad0 spawns a new instance of our test car. ”I” mounts the rendering camera on the test car. 12
  • 13. Figure 3: Test Vehicle ”O” restores the control of the rendering camera back to the original state. Let us look at the some of the code for the test car. The TestVehicle() function is a constructor for the TestVehicle class. It is called once when all of the scripts are loaded. This occurs at the start of the game and can be triggered at any point in the game by hitting the ”insert” key. This constructor gains control of the camera which is rendering the game by destroying all cameras and creating a new rendering camera. The function responsible for this is World.CreateCamera. The first two arguments represent position and rotation. The last argument is the field of view in degrees. We set it to 50, however this could be changed to fit the parameters of a real world camera. It is important to note GTA.Native.Function.Call. GTA 5’s game engine has thousands of native functions which were used by the developers to build the game. This library encapsulates some of them. Others can be called using GTA.Native.Function.Call where the first argument is the hash code of the function to call and the remaining arguments are the arguments to pass to the native function. One of the biggest challenges in this project is figuring out what these other arguments represent and control. There are 13
  • 14. online databases where players of the game list known functions and parameters. These databases are far from complete. Therefore, for some of these native function calls, some of the arguments may not have any justification besides that they make the function work. This is the price paid for using a closed source game. public TestVehicle() { UI.Notify("Loaded TestVehicle.cs"); // create a new camera World.DestroyAllCameras(); camera = World.CreateCamera(new Vector3(), new Vector3(), 50); camera.IsActive = true; GTA.Native.Function.Call(Hash.RENDER_SCRIPT_CAMS, false, true, camera.Handle, true, true); // attach time methods Tick += OnTick; KeyUp += onKeyUp; } The camera position and rotation do not matter in the previous function as they will be dynamically updated to keep up with the position and rotation of the car. This is accomplished by updating both properties at everything tick of the game. A tick is a periodic call of the OnTick function. On each tick, we will keep the camera following the car by setting its rotation and position to be that of the test car. The position of the camera is offset by 2 meters forward and 0.4 meters up relative to the center of the test car. This places the camera on the center of the hood of the car as seen in Figure 4. // Function used to keep camera on vehicle and facing forward on each tick step. public void keepCameraOnVehicle() { if (Game.Player.Character.IsInVehicle()) { // keep the camera in the same position relative to the car camera.AttachTo(Game.Player.Character.CurrentVehicle, new Vector3(0f, 2f, 0.4f)); // rotate the camera to face the same direction as the car camera.Rotation = Game.Player.Character.CurrentVehicle.Rotation; } } 14
  • 15. Figure 4: The red dot represents camera location. void OnTick(object sender, EventArgs e) { keepCameraOnVehicle(); } 3.3 Desired Functions Being inside the game with our test vehicle, we want to collect training data. Existing datasets provide good inspiration for what should be collected. A common datum is the coordinates of bounding boxes for objects such as cars as in [7], [9] and [13] and traffic signs as in [23], [21], [15] and [24]. Pixel maps representing areas in the image where cer- tain objects are also common. ROMA [30] has pixel of lane marking. KITTI Road/Lane Detection Evaluation 2013 [12] has pixel of road areas marked. Daimler Pedestrian Seg- mentation Benchmark Dataset [11] has pixel of pedestrians marked. Lastly, we would like to make measurements of distances to lanes and cars in a framework from [8]. The following sections describe ways of collecting the above information for X, Y data pairs. 15
  • 16. 3.4 Screenshots To collect X, we take a screen shot of the game. GTA 5 runs only on Windows. Using Windows user32.dll functions GetForegroundWindow, GetClientRect, and Client- ToScreen, we can extract the exact area of the screen where the game appears. Neural networks take small, usually 100 pixels by 200 pixels, images as input, we set the game resolution to be as small as possible and let h = IMAGE HEIGHT = 600 pixels and w = IMAGE WIDTH = 800 pixels. These could be furthered scaled down to fit a particular model such as [8]. For implementation please see Appendix A. 3.5 Bounding Boxes A bounding box is a pair of points which defines a rectangle which encompasses an object in an image. Let b = {(xmin, ymin), (xmax, ymax)} be a bounding box, where xmin, ymin, xmax, ymax are coordinates in an image in pixels with the upper left corner being the origin. The task of creating bounding boxes includes computing the extremes of a 3 dimensional object and enclosing them in a rectangle. The algorithm for doing this is very simple. Algorithm 1 Algorithm for computing a bounding box. Require: Model m and center c of an object 1: get the dimensions of m → (h, w, d) 2: compute unit vectors with respect to the object (ex, ey, ez) 3: using ex, ey, ez and h, w, d compute the set of vertices v of a cube enclosing the object 4: map each point p ∈ v to the viewing plane using g : R3 → R3 to create set z 5: xmin = min x z 6: xmax = max x z 7: ymin = min x z 8: ymax = max y z 9: if xmin < 0 then xmin = 0 10: if xmax > IMAGE WIDTH then xmax = IMAGE WIDTH 11: if ymin < 0 then ymin = 0 12: if xmax > IMAGE HEIGHT then ymax = IMAGE HEIGHT In GTA 5 it is very easy to compute ex, ey, ez and get h, w, d for models for cars, pedestrians, and traffic signs. Therefore, it is easy to create a bounding cube around an object. The code excerpt below details the calculation. e is the object we wish to bound and dim is a vector of the dimensions of the model h, w, d. Vector3[] vertices = new Vector3[8]; 16
  • 17. vertices[0] = FUL; vertices[1] = FUL - dim.X*e.RightVector; vertices[2] = FUL - dim.Z*e.UpVector; vertices[3] = FUL - dim.Y*Vector3.Cross(e.UpVector, e.RightVector); vertices[4] = BLR; vertices[5] = BLR + dim.X*e.RightVector; vertices[6] = BLR + dim.Z*e.UpVector; vertices[7] = BLR + dim.Y*Vector3.Cross(e.UpVector, e.RightVector); There is a function called WorldToScreen which takes a 3 dimensional point in the world and computes that points location on the screen. Unfortunately, this function returns the origin if a point is not visible on the screen. This is a problem as we want to draw a bounding box even if part of the object is out of view, a car coming in on the left for example. In these cases we want the bounding box to extend to the edge of the screen. The simplest solution is to map all points to the viewing plane which is infinite and follow the algorithm above. This requires a custom g function and a good understanding of the camera model. 3.5.1 GTA 5 Camera Model Let’s first establish some terminology. Let e ∈ R3 be the location of the observer and let c ∈ R3 be a point on the viewing plane, the plane where the image of the world is formed, such that vector p from e to c represents the direction the camera is pointing and is perpendicular to the viewing plane. Additionally, let θ be a rotation vector of the camera relative to the world coordinates. After a lot of experimentation, we determined that the position property of the camera object in GTA 5 refers to e. θ measures angles counterclockwise in degrees. When θ = 0, the camera is facing down the positive y-axis and the view plane is thus the xz-plane. The order of rotation from this position is around x-axis then y-axis and then z-axis. 3.5.2 From 3D to 2D Based on the information about the camera model, we can take a 3 dimensional point in the world and then map it to the viewing plane and then transform it to screen pixels. Let a ∈ R3 be the point we wish to map. First we must transform this point to the camera coordinates. This is accomplished by rotating a using the equations below and subtracting c, the subtraction is omitted. 17
  • 18. Figure 5: Camera model and parameters in GTA 5   dx dy dz   =   cos(θx) −sin(θx) 0 sin(θx) cos(θx) 0 0 0 1     cos(θy) 0 sin(θy) 0 1 0 −sin(θy) 0 cos(θy)     1 0 0 0 cos(θx) −sin(θx) 0 sin(θx) cos(θx)     ax ay az   dx = cos(θz)[axcos(θy) + sin(θy)[aysin(θx) + azcos(θx)]] − sin(θz)[aycos(θx) − azsin(θx)] dy = sin(θz)[axcos(θy) + sin(θy)[aysin(θx) + azcos(θx)]] + cos(θz)[aycos(θx) − azsin(θx)] dz = −axsin(θy) + cos(θy)[aysin(θx) + azcos(θx)] We also need to rotate the vector representing the z direction in the world, vup,world and the vector representing the x direction in the world, vx,world. We also need to compute the width and hight of the region of the view plane which is actually displayed on screen. We call this region the view window. In the equations below F is the field of view in radians and dnear clip is the distance between c and e. viewWindowHeight = 2 ∗ dnear cliptan(F/2) viewWindowWidth = IMAGE WIDTH IMAGE HEIGHT ∗ viewWindowHeight 18
  • 19. We then compute the intersection point between vector d − e and the viewing plane, call it pplane. We translate the origin to the upper left corner of the view window and update pplane to pplane. newOrigin = c + viewWindowHeight 2 ∗ vup,camera − viewWindowWidth 2 ∗ vx,camera pplane = (pplane + c) − newOrigin Next we calculate the coordinates of pplane in the two dimensions of the plane. viewPlaneX = p T planevx,camera vT x,cameravx,camera viewPlaneZ = p T planevup,camera vT up,cameravup,camera Finally we scale the coordinates to the size of the screen. UI.WIDTH and UI.HEIGHT are in-game constants. screenX = viewPlaneX viewWindowWidth ∗ UI.WIDTH screenY = −viewPlaneZ viewWindowHeight ∗ UI.HEIGHT The process is summarized below. Algorithm 2 get2Dfrom3D: Algorithm for computing screen coordinates of a 3D point. Require: a 1: translate and rotate a into camera coordinates point d 2: rotate vup,world, vx,world to vup,camera, vx,camera 3: compute viewWindowHeight, viewWindowWidth 4: find intersection of d − e with the viewing plane 5: translate origin of the viewing plane 6: calculate the coordinates of the intersection point in the plane 7: scale the coordinates to screen size in pixels 19
  • 20. 3.5.3 General Approach To Annotation of Objects The main objective is to draw bounding boxes around objects which are within a certain distance. There exist functions GetNearbyVehicles, GetNearbyPeds, and Get- NearbyEntities. These functions allows us to get an array of all cars, pedestrians and objects in an area around the test car. Each object can be tested individually to see if it is visible on the screen. We created a custom function for doing so as the in game function has unreliable behavior. This function works by checking if it is possible to draw a strait line between e and at least one of the vertices of the bounding cube without hitting any other object. The name of this methods is ray casting and it will be discussed in more detail later. It must be noted that in the hierarchy of the game, pedestrians and vehicles are also entities. Therefore a filtering process is applied when bounding signs. This process is discussed in the signs section. 3.5.4 Cars Compared to TORCS, GTA 5 has almost ten time more car models. There are 259 vehicles in GTA V (See http://www.ign.com/wikis/gta-5/Vehicles for the complete list). There vehicles are of various shapes and sizes, from golf carts to truck and trailers. This diversity is more representative of the real distribution of vehicles and can hopefully be utilized to train more accurate neural networks. The above method can put a bounding box around any of these vehicles. Please see Figures 6, 7, and 8 for examples. 3.5.5 Pedestrians Pedestrians can also be bounded for classification and localization training. GTA 5 has pedestrians of various genders and ethnicities. More importantly, the pedestrians in GTA 5 perform various actions like standing, crossing streets, sitting etc. This creates a lot of diversity for training. The draw back of GTA 5 is that all pedestrians are about the same height. 3.5.6 Signs As mentioned before, signs are a bit more tricky to bound. There are two reasons for this. First, the only way to find get signs which are around the test vehicle is to get all entities. This includes cars, pedestrians, and various miscellaneous props, many of which 20
  • 21. Figure 6: Two cars bounded in boxes. Weather: rain. Figure 7: Two cars bounded in boxes. 21
  • 22. Figure 8: Traffic jam bounded in boxes. Figure 9: Pedestrians bounded in boxes. 22
  • 23. Sign Description DOT Id [3] GTA Picture Stop Sign R1-1 Yield Sign R1-2 One Way Sign R6-1 No U-Turn Sign R3-4 Freeway Entrance D13-3 Do Not Enter Wrong Way Sign R5-1 and R5-1a Figure 10: Some of the traffic signs present in GTA 5. are of no interest. Thus we need to check each entity for its model to see if it is a traffic sign. To do so, we need a list of all of the models of all traffic signs in GTA 5. This list would include many of the signs listed in Uniform Traffic Control Devices [3]. See Figure 10 for some of the signs in GTA 5. The second difficulty with traffic signs is that they may require more than one bounding box. For example, a traffic light may have several lights on it, see figure 12. This leads to the idea of spaces of interest, or SOP. One sign model may have several space of interest we wish to bound. 23
  • 24. Figure 11: Stop sign in bounding box. Figure 12: Traffic lights in bounding boxes. 24
  • 25. There is an elegant solution to both problems. The solution is a database of spaces of interest. Every entry contains a model hash code, name of the sign, and the x,y,z coordinates of the front upper left and back lower right vertices of the bounding cube. Which such a database, the algorithm for bounding sign is as follows: Algorithm 3 Algorithm for bounding signs. Require: d - database of spaces of interest 1: read in d 2: get array of entities e from GetNearbyEntities 3: for each entity in e do 4: check if the model of the entity matches any hash codes in d 5: get all the matching spaces of interest 6: for each space of interest do 7: draw a bounding box 3.6 Pixel Maps Pixel maps are more refined bounding boxes. Instead of marking an entity with four pixels, we mark it with every pixel it occupies on the screen. This can be done easily when we start with a bounding box b = {(xmin, ymin), (xmax, ymax)} and invert the function which maps 3 dimensional point to the screen. The inverse of g can be constructed as follows. Given a screenX and screenY in pixels, we transform the pixel values to coordinates on the viewing plane. Next, we transform the point on the viewing plane into a point in the 3 dimensional world, pworld. viewPlaneX = screenX UI.WIDTH ∗ viewWindowWidth viewPlaneZ = −screenY UI.HEIGHT ∗ viewWindowHeight pworld = viewPlaneX ∗ vx,camera + viewPlaneZ ∗ vup,camera + newOrigin Once we compute pworld, we use Raycast function to get the entity which occupies that pixel. They Raycast function requires a point of origin, in our case e, a direction, in our case pworld −e and a maximum distance the ray should travel, which we could set to be a very large number like 10,000. If the entity returned by Raycast matches the entity the bounding box encloses, then we added the pixel to the map. 25
  • 26. Algorithm 4 Algorithm for computing a pixel map of an entity. Require: entity, b = {(xmin, ymin), (xmax, ymax)} 1: let map be a boolean array IMAGE WIDTH by IMAGE HEIGHT 2: for x ∈ {xi|xi ∈ Z, xmin ≤ xi ≤ xmax} do 3: for y ∈ {yi|yi ∈ Z, ymin ≤ yi ≤ ymax} do 4: compute pworld of x, y 5: Raycast from e in direction of pworld − e to get entityRaycast 6: if entity = entityRaycast then 7: set map[x, y] to true Depending on the application, these maps can be combined together using the OR boolean function. The function for pixel maps is yet to be implemented due to time constraints. Besides being a trivial extension of bounding boxes, it is also less useful for machine learning due to a cumbersome and perhaps unnecessarily complex representation of objects. Figures 13 and 14 show what the result of such a function would look like. 3.7 Road Lanes Identifying and locating cars, pedestrians, and signs will only help with a part of the driving task. Even without any of these things present, drivers must still stay within a specified lane. Ultimately, locating the lanes and the vehicle’s position in them is the foundation of the driving task. We will explore a method for extracting information similar to [8] from GTA 5. 3.7.1 Notes on Drivers First, let’s examine how real drivers collect information on lane positions. There is ample literature on the topic. The general consensus is that humans look ahead about 1 second to locate lanes. [10] [20] [19] This time applies for speeds between 30 kmh and 60 kmh [10] [19] and corresponds to a distance of about 10 meters. In a more detailed model, human drivers have 2 distances at which they collect information. At 0.93 s or 15.7 m road curvature information is collected [19] and at 0.53 or 9 m position in lane is collected [19]. Near information is used to fine tune driving and is sufficient at low speeds [19]. At high speeds, the further level is used for guidance and stabilization [10]. Divers also look about 5.5 degrees below the true horizon for road data. [19] For curves, humans use a tangent point on the inside of the curve for guidance [20]. They locate this point 1 to 2 seconds before entering the curve. [20] 26
  • 27. Figure 13: Image with a bounding box. 27
  • 28. Figure 14: Image with a pixel map for a car applied. 28
  • 29. 3.7.2 Indicators From literature on human cognition, we know where people look for information on road lanes. In [8], we find a very useful model on what information to collect. Chenyi et al. system uses 13 indicators for navigating down a highway like racetrack. While this roadway is very simple compared to real world road which have exits, entrances, shared left turn lanes, and lane merges, the indicators are quite universal. Figure 15 lists the indicators, their descriptions, and ranges. 3.7.3 Road Network in GTA 5 The GTA 5 road network is composed of 74,530 nodes and 77,934 links. [2] For each node there are x, y, z coordinates and 19 flags and each link consists of 2 node ids and 4 flags. [2] This information is contained in paths.ipl. Figures 16 and 17 show which flags are currently known. It does not appear that any of these flags would be particularly useful to figuring out the location of the lane markings. The Federal Highway Administration sets lane width for freeway lane at 3.6 m (12 feet) and for local roads between 2.7 m and 3.6 m. Ramps are between 3.6 and 9 m (12 to 30 feet). [1]. Based on measurements, the lanes in GTA 5 are 5.6 meters wide. This should not be a problem when the network is applied to real world applications since the output can always be scaled. 3.7.4 Finding the Lanes We know what information we would like to collect and we know that we want to collect it at a point in the road about 10 meters in front of the test car. Figure 18 represents our data collection situation. We want to compute where the lanes are at blue line. Assuming we could locate the markings for the left, right and middle lanes, we could then see if there are any cars whose positions fall between these points. The cars would also have to be visible on the screen and no further then some maximum distance. Following [8], this distance d could be 70 meters. We can compute the indicators if we know the position of the lanes and the heading of the road. Let h be the heading vector of the road at the 10 meter mark. Let LL, ML, MR, and RR be points on the lane markings where the blue line intersects the lanes. Let f be a point on the ground at the very front of the test vehicle, possibly below the camera. We will perform the calculation for the three lane indicators are the two lane indicator can be filled in with values set based on these indicators. The angle is simply the angle between the test car heading vector, hcar and the road heading vector. 29
  • 30. Indicators Indicator Description Min Value Max Value angle angle between the cars heading and the tangent of the road -0.5 0.5 dist L distance to the preceding car in the left lane 0 75 dist R distance to the preceding car in the right lane 0 75 toMarking L distance to the left lane marking -7 -2.5 toMarking M distance to the central lane marking -2 3.5 toMarking R distance to the right lane marking 2.5 7 dist LL dist LL: distance to the preceding car in the left lane 0 75 dist MM dist MM: distance to the preceding car in the current lane 0 75 dist RR dist RR: distance to the preceding car in the right lane 0 75 toMarking LL distance to the left lane marking of the left lane -9.5 -4 toMarking ML distance to the left lane marking of the current lane -5.5 -0.5 toMarking MR distance to the right lane marking of the current lane 0.5 5.5 toMarking RR distance to the right lane marking of the right lane 4 9.5 Figure 15: List of indicators, their ranges and positions. Distances are in meters, and angles are in radians. Graphic reproduced from [8]. 30
  • 31. Flag Meaning 0 0 (primary) or 1 (secondary or tertiary) 1 0 (land), 1 (water) 2 unknown (0 for all nodes) 3 unknown (1 for 65,802 nodes, otherwise 0, 2, or 3) 4 0 (road), 2 (unknown), 10 (pedestrian), 14 (interior), 15 (stop), 16 (stop), 17 (stop), 18 (pedestrian), 19 (restricted) 5 unknown (from 0/15 to 15/15) 6 unknown (0 for 60,111 nodes, 1,141 other values) 7 0 (road) or 1 (highway or interior) 8 0 (primary or secondary) or 1 (tertiary) 9 0 (most nodes) or 1 (some tunnels) 10 unknown (0 for all nodes) 11 0 (default) or 1 (stop - turn right) 12 0 (default) or 1 (stop - go straight) 13 0 (major) or 1 (minor) 14 0 (default) or 1 (stop - turn left) 15 unknown (1 for 10,455 nodes, otherwise 0) 16 unknown (1 for 32 nodes, otherwise 0, on highways) 17 unknown (1 for 62 nodes, otherwise 0, on highways) 18 unknown (1 for 92 nodes, otherwise 0, some turn lanes) Figure 16: Flags for links. [2] Flag Meaning 0 unknown (-10, -1 to 8 or 10) 1 unknown (0 to 4 or 6) 2 0 (one-way), 1 (unknown), 2 (unknown), 3 (unknown) 3 0 (unknown), 1 (unknown), 2 (unknown), 3 (unknown), 4 (unknown), 5 (unknown), 8 (lane change), 9 (lane change), 10 (street change), 17 (street change), 18 (unknown), 19 (street change) Figure 17: Flags for nodes. [2] angle = cos−1 ( h · hcar ||h||||hcar|| ) For toMarking LL, toMarking ML, toMarking MR, toMarking RR, we will assume that lanes are straight lines. We have a point on those lines and a vector indicating the direction they are heading. This assumption is crude, however at the distances we are discussing it should not produce large errors. Additionally, we could adjust the distance 31
  • 32. Figure 18: Blue line represents where we want to collect data on lane location. 32
  • 33. at which we sample date based on road heading. This would not only be more in line with human behavior [10] [20] [19], it would also reduce errors. To compute the distance we must project vector f − LL on to vector −h and compute the distance between the projected point and f − LL. We will work out the mathematics for the left marking of the left lane, LL. r = proj−h(f − LL) = (f − LL) · (−h) || − h||2 − h toMarking LL = ||(f − LL) − r|| To compute dist LL, dist MM, dist RR, we must first figure out which vehicles are in which lanes. For all the vehicles returned by GetNearbyVehicles, we can eliminate any whose heading vector form an angle of more than 90 degrees with the heading of the road. The position of the vehicle, p, must be within a rectangular prism formed by LL, RR, f and f + d ∗ h in the direction normal to the ground which is also other world up vector for the test car, vup. This can be computed by projecting LL − f, RR − f, and d ∗ h onto the plane of f and vup. The following are the projections of the points. rLL = LL − projvup (LL) = LL · vup ||vup||2 vup rRR = RR − projvup (RR) = RR · vup ||vup||2 vup rf+d∗h = (f + d ∗ h) − projvup ((f + d ∗ h)) = (f + d ∗ h) · vup ||vup||2 vup rp = (p) − projvup ((p)) = (p) · vup ||vup||2 vup Now we just have to check that the y coordinate of rp is between the y coordinates of rLL and rRR and the x coordinate of rp is between 0 and the x coordinate of rf+d∗h. If the vehicle satisfies these bounds, we can compute its distance to all lane marking in the same way we did for the test vehicle. We then check to which marking it is closest to and assign it to that lane, or perform additional logic. Let assume it is the left lane. We perform the following to compute dist LL. r = projhcar (p − f) = (p − f) · (hcar) ||hcar||2 hcar dist LL = ||r|| 33
  • 34. Algorithm 5 Algorithm for computing dist LL, dist MM, and dist RR. Require: entity, b = {(xmin, ymin), (xmax, ymax)} 1: create arrays dist LLs, dist MMs, and dist RRs and add d to each 2: l is the lane of the vehicle 3: for each vehicle v returned by GetNearbyVehicles do 4: if cos−1 ( h·hcar ||h||||hcar|| ) < π 2 then 5: if p is in the three lanes, in front of test car, and close then 6: compute toMarking LL, toMarking ML, toMarking MR, and toMark- ing RR for p 7: if toMarking LL is smallest then 8: l = left lane 9: if toMarking RR is smallest then 10: l = right lane 11: if toMarking ML is smallest AND toMarking LL < toMarking MR then 12: l = left lane 13: else 14: l = middle lane 15: if toMarking MR is smallest AND toMarking RR < toMarking MR then 16: l = right lane 17: else 18: l = middle lane 19: if l = right lane then 20: add ||projhcar (p − f)|| to dist RRs 21: else if l = left lane then 22: add ||projhcar (p − f)|| to dist LLs 23: else 24: add ||projhcar (p − f)|| to dist MMs 25: dist RR = min dist RRs 26: dist LL = min dist LLs 27: dist MM = min dist MMs 34
  • 35. To perform the above computation we need a vector representing the heading of the road and a point on each lane marking. This is where the challenge begins. We cannot use any of the functions or methods discussed for objects because roads and lane markings are not entities. The road is part of the terrain and the lanes are a texture. Therefore, we cannot get the width of the road model or the position of a lane marking the way we obtained those properties for cars. GTA 5 has realistic traffic. There are many AI driven cars in the game which navigate the road network while staying in lanes. Therefore, the game engine knows the location of the lane markings. There are several functions which pertain to roads. GetStreetName returns the name of a street at a specified point in the world. IS POINT ON ROAD is a native function which checks if a point in on a road. There are also several functions which deal with vehicle nodes. Vehicle nodes appear to be the primary way the graph of the road network is repre- sented in the game. Every vehicle node is a point at the center of the road, as seen in Figure 19. They are spaced out in proportion to the curvature of the road; close together at sharp corners and further apart on straight stretches of road. Each node has an unique id. The main functions for working with nodes include GET NTH CLOSEST VEHICLE NODE and GET NTH CLOSEST VEHICLE NODE ID. A way to call them in a script is shown in the code snippet below. In this code snippet, the ”safe” arguements serve an unknown purpose, as do the two zeros in GET NTH CLOSEST VEHICLE NODE ID. The i vari- able specified which node in the order of proximity should be selected. There is also a function GET VEHICLE NODE PROPERTIES. However, we could not find a way to get this function to work. OutputArgument safe1 = new OutputArgument(); OutputArgument safe2 = new OutputArgument(); OutputArgument safe3 = new OutputArgument(); Vector3 midNode; OutputArgument outPosArg = new OutputArgument(); Function.Call(Hash.GET_NTH_CLOSEST_VEHICLE_NODE, playerPos.X, playerPos.Y, playerPos.Z, i, outPosArg, safe1, safe2, safe3); midNode = outPosArg.GetResult<Vector3>(); int nodeId = Function.Call<int>(Hash.GET_NTH_CLOSEST_VEHICLE_NODE_ID, playerPos.X, playerPos.Y, playerPos.Z, i, safe1, 0f, 0f); 35
  • 36. Figure 19: Red markers represent locations of vehicle nodes. 36
  • 37. The benefit of this system is that we can locate our car on the network by getting the closest node. Given the road heading and lane width, it is possible to compute the centers of lanes as seen in Figure 20. The problem is that there is no way of getting the heading of the road and the number as well position of lanes around the node as far as we could find. Figure 20: Red markers represent locations of vehicle nodes. Blue markers are extrapo- lations of lane middles based on road heading and lane width. A promising approach to solving this problem was road model fitting. We know that the node is at the center of the road. We do not know if it is on a lane marking or in a middle of a lane. We could assume that it is on a lane marking and then count the number of lanes on the left and right. This could be done by moving a lane width over and checking if the point is still on the road by using IS POINT ON ROAD and GetStreetName. We can repeat the same method under the assumption that the node 37
  • 38. is in the middle of a lane. Whichever assumption found more lanes, that is the correct assumption as the wrong one will not count the outer most lanes. This still leaves the question of finding the heading of the road and if the node is between lanes going in opposite directions. However, there are two fundamental problems with this approach which make it useless. First, this approach assumes that the nodes at centers of lanes or on lane markings. Upon further exploration, we found that nodes can be on medians; for example Figure 21. This is still the center of the road, just not where we expect it. Second, IS POINT ON ROAD is not a reliable indicator of whether a point is actually on a road. Sometimes is returns false for points which are clearly on the road, and sometimes it returns true for points which are on the side of the road. Figure 21: Red markers represent locations of vehicle nodes. Blue markers are extrapo- lations of lane middles based on road heading and lane width. The blue marker in front of the test car represents where we want to measure lanes. 38
  • 39. There are two solutions to this problem. The first solution is to keep hacking at the game until we find all of this information. The information we are looking for must be somewhere in the game because the game AI knows where to drive. It knows where the lanes are and how to stay in them. The second solution is to build a database of nodes. Figure 22 lists the data which would be stored in this database. Field Meaning nodeId The numerical id of the node. onMarking True if the node is on a lane marking or in the middle of a lane. oneWay True if the traffic on both sides of the node moves in the same direction. leftStart Vector representing the point where the road begins left of the node. leftEnd Vector representing the point where the road ends left of the node. rightStart Vector representing the point where the road begins right of the node. rightEnd Vector representing the point where the road ends right of the node. heading Vector representing the heading of the road. Figure 22: Node database entry design. The problem with this method is that there are over 70,000 nodes and there does not appear to be an easy way of collecting this information. At the present moment, there does not appear to be a simpler solution. 39
  • 40. 4 Towards The Ultimate AI Machine Previous section outlined methods for getting information out of GTA 5 to create datasets. To fully utilize GTA 5, we still need to create a database of nodes and spaces of interest. Once that is done, we will move on to creating datasets and training neural networks. The objective of harvesting this data has be emphasized as training data for neural networks. However, the ultimate goal is much grander. That is building a system which can master driving in GTA 5. This system would probably include several neural networks and perhaps other statistical models. For example, this system may include a network for locating pedestrians, a SVM for classifying street signs, another network for recognizing traffic lights etc. All of these components would be linked together by some master program that would construct the most likely world model based on all of these ”sensors”. Then another program would be responsible for driving the car. Since we can extract data from GTA 5 in real time, we can test how well this system would work in changing conditions. In the process of building such a system, it is possible to test out some new ideas in neural networks. We would like to continue to explore curriculum learning [5] and self-paced learning [18] [16] as means of presenting examples in order of difficult. Since these ideas have been applied to object tracking in video [28], teaching robots motor skills [17], matrix factorization [31], handwriting recognition [22], and multi-task learning [26]; surpassing state-of-the-art benchmarks, we hope that they could be used to improve autonomous driving. Another interesting idea is transfer learning [25], or the ability to use a network trained in one domain in another domain. This could be applied in pedestrian and sign classifiers. Lastly, we have been working on ways to use optimal learning to select best neural network architectures. It would be interesting to try those methods in this application. Building this system presents 2 major difficulties. First, both the game and the neural networks are GPU intensive processes. Running both on a single machine would require a lot of computational power. Second, GTA 5 will only work on Windows PCs, while most deep learning libraries are Linux based. Porting either application is close to infeasible. Last semester, we, working with Daniel Stanley and Bill Zhang, constructed a solution for running GTA 5 with TorcsNet from [8]. The idea was to run the processes on separate machines and have them communicate via a shared folder on a local network, see Figure 23. During the tests, the amount of data transfered was small, a text file of 13 floats and a 280 by 210 png image. This setup should is fast enough for the system run at around 10 Hz. 40
  • 41. Figure 23: GTA V Experimental Setup 4.1 Future Research Goals Build a database of GTA V road nodes Build a database of GTA V road signs Train sign classifier Train traffic lights classifier Compare how well GTA V trained classifier works on real datasets Check how well the TORCS network can identify cars in GTA V Build a robust controller in GTA V which uses all 13 indicators Explore the effects of curriculum learning on driving performance Explore transfer learning and optimal learning for neural networks Test trained models in a real vehicle (PAVE) 41
  • 42. A Screenshot Function private struct Rect { public int Left; public int Top; public int Right; public int Bottom; } [DllImport("C:WindowsSystem32user32.dll")] private static extern IntPtr GetForegroundWindow(); [DllImport("C:WindowsSystem32user32.dll")] private static extern IntPtr GetClientRect(IntPtr hWnd, ref Rect rect); [DllImport("C:WindowsSystem32user32.dll")] private static extern IntPtr ClientToScreen(IntPtr hWnd, ref Point point); void screenshot(String filename) { //UI.Notify("Taking screenshot?"); var foregroundWindowsHandle = GetForegroundWindow(); var rect = new Rect(); GetClientRect(foregroundWindowsHandle, ref rect); var pTL = new Point(); var pBR = new Point(); pTL.X = rect.Left; pTL.Y = rect.Top; pBR.X = rect.Right; pBR.Y = rect.Bottom; ClientToScreen(foregroundWindowsHandle, ref pTL); ClientToScreen(foregroundWindowsHandle, ref pBR); Rectangle bounds = new Rectangle(pTL.X, pTL.Y, rect.Right - rect.Left, rect.Bottom - rect.Top); using (Bitmap bitmap = new Bitmap(bounds.Width, bounds.Height)) { using (Graphics g = Graphics.FromImage(bitmap)) { 42
  • 43. g.ScaleTransform(.2f, .2f); g.CopyFromScreen(new Point(bounds.Left, bounds.Top), Point.Empty, bounds.Size); } Bitmap output = new Bitmap(IMAGE_WIDTH, IMAGE_HEIGHT); using (Graphics g = Graphics.FromImage(output)) { g.DrawImage(bitmap, 0, 0, IMAGE_WIDTH, IMAGE_HEIGHT); } output.Save(filename, ImageFormat.Bmp); } } 43
  • 44. References [1] Lane width. http://safety.fhwa.dot.gov/geometric/pubs/ mitigationstrategies/chapter3/3_lanewidth.cfm. Accessed: 2016-4-29. [2] Paths (gta v). http://gta.wikia.com/wiki/Paths_(GTA_V). Accessed: 2016-4-29. [3] F. H. Administration. Manual on uniform traffic control devices. 2009. [4] A. R. Atreya, B. C. Cattle, B. M. Collins, B. Essenburg, G. H. Franken, A. M. Saxe, S. N. Schiffres, and A. L. Kornhauser. Prospect eleven: Princeton university’s entry in the 2005 darpa grand challenge. Journal of Field Robotics, 23(9):745–753, 2006. [5] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009. [6] C. G. C. D. R. C. A. S. Bernhard Wymann, Eric Espie. Torcs the open racing car simulator. http://www.torcs.org, 2014. [7] S. M. Bileschi. StreetScenes: Towards scene understanding in still images. PhD thesis, Citeseer, 2006. [8] C. Chen, A. Seff, A. Kornhauser, and J. Xiao. Deepdriving: Learning affordance for direct perception in autonomous driving. arXiv preprint arXiv:1505.00256, 2015. [9] P. Doll´ar, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: A benchmark. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 304–311. IEEE, 2009. [10] E. Donges. A two-level model of driver steering behavior. Human Factors: The Journal of the Human Factors and Ergonomics Society, 20(6):691–707, 1978. [11] F. Flohr, D. M. Gavrila, et al. Pedcut: an iterative framework for pedestrian seg- mentation combining shape models and multiple data cues. 2013. [12] J. Fritsch, T. Kuehnl, and A. Geiger. A new performance measure and evaluation benchmark for road detection algorithms. In International Conference on Intelligent Transportation Systems (ITSC), 2013. [13] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012. [14] E. Guizzo. How googles self-driving car works. IEEE Spectrum Online, October, 18, 2011. 44
  • 45. [15] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and C. Igel. Detection of traffic signs in real-world images: The German Traffic Sign Detection Benchmark. In International Joint Conference on Neural Networks, number 1288, 2013. [16] L. Jiang, D. Meng, S.-I. Yu, Z. Lan, S. Shan, and A. Hauptmann. Self-paced learning with diversity. In Advances in Neural Information Processing Systems, pages 2078– 2086, 2014. [17] A. Karpathy and M. Van De Panne. Curriculum learning for motor skills. In Advances in Artificial Intelligence, pages 325–330. Springer, 2012. [18] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 1189–1197. Curran Associates, Inc., 2010. [19] M. Land, J. Horwood, et al. Which parts of the road guide steering? Nature, 377(6547):339–340, 1995. [20] M. F. Land and D. N. Lee. Where do we look when we steer. Nature, 1994. [21] F. Larsson and M. Felsberg. Using fourier descriptors and spatial models for traffic sign recognition. In Image Analysis, pages 238–249. Springer, 2011. [22] J. Louradour and C. Kermorvant. Curriculum learning for handwritten text line recognition. In Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on, pages 56–60. IEEE, 2014. [23] M. Mathias, R. Timofte, R. Benenson, and L. Van Gool. Traffic sign recognitionhow far are we from the solution? In Neural Networks (IJCNN), The 2013 International Joint Conference on, pages 1–8. IEEE, 2013. [24] A. Mogelmose, M. M. Trivedi, and T. B. Moeslund. Vision-based traffic sign detec- tion and analysis for intelligent driver assistance systems: Perspectives and survey. Intelligent Transportation Systems, IEEE Transactions on, 13(4):1484–1497, 2012. [25] S. J. Pan and Q. Yang. A survey on transfer learning. Knowledge and Data Engi- neering, IEEE Transactions on, 22(10):1345–1359, 2010. [26] A. Pentina, V. Sharmanska, and C. H. Lampert. Curriculum learning of multiple tasks. arXiv preprint arXiv:1412.1353, 2014. [27] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Cognitive modeling, 5(3):1, 1988. 45
  • 46. [28] J. S. Supancic and D. Ramanan. Self-paced learning for long-term tracking. In Com- puter Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 2379–2386. IEEE, 2013. [29] S. Thrun, M. Montemerlo, H. Dahlkamp, D. Stavens, A. Aron, J. Diebel, P. Fong, J. Gale, M. Halpenny, G. Hoffmann, et al. Stanley: The robot that won the darpa grand challenge. Journal of field Robotics, 23(9):661–692, 2006. [30] T. Veit, J.-P. Tarel, P. Nicolle, and P. Charbonnier. Evaluation of road mark- ing feature extraction. In Proceedings of 11th IEEE Conference on Intelli- gent Transportation Systems (ITSC’08), pages 174–181, Beijing, China, 2008. http://perso.lcpc.fr/tarel.jean-philippe/publis/itsc08.html. [31] Q. Zhao, D. Meng, L. Jiang, Q. Xie, Z. Xu, and A. G. Hauptmann. Self-paced learning for matrix factorization. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. 46