This document summarizes a research paper that proposes applying game theory to model traffic assignment. Specifically, it develops a stochastic congestion game model to describe drivers' adaptive route choices from a day-to-day perspective. The model accounts for drivers as individual decision-makers and allows their payoffs to include unknown noise. A simulation is conducted using the SUMO software to validate that the model converges to a Nash equilibrium. The simulation shows players' payoffs converge to a Nash equilibrium almost surely, demonstrating the model successfully captures drivers' learning behavior over time as a transportation network with traffic management systems.
A New Perspective Of Traffic Assignment A Game Theoretical Approach
1. Journal of the Eastern Asia Society for Transportation Studies, Vol.11, 2015
488
A New Perspective of Traffic Assignment: A Game Theoretical Approach
Genaro PEQUE, Jr. a
, Toshihiko MIYAGI b
, Fumitaka KURAUCHI c
a,b,c
Department of Civil Engineering, Gifu University, Gifu, 501-1193, Japan
a
E-mail: gpequejr@ gifu-u.ac.jp
b
E-mail: t_miyagi@gifu-u.ac.jp
c
E-mail: kurauchi@gifu-u.ac.jp
Abstract: Traditional equilibrium models consider transportation networks with well-defined
link travel time functions and continuous drivers. Recently, researchers focused on adding the
behavioral dimension lacking in traditional equilibrium models by treating drivers as individual
decision-makers (atomic drivers). However, there is currently no underpinning theory that
supports the shift from macroscopic to microscopic traffic assignment modeling.
In this paper, a game theoretical model which provides this link is presented. We will
show that this model describe driversβ adaptive behaviors as they perform day-to-day route
choices. Drivers acquire payoffs with unknown noise of their chosen and alternative routes.
This scenario describes a transportation network with the presence of a Traffic Management
Center (TMC).
Finally, a simulation-based dynamic traffic assignment simulation is carried out to
validate the model using the Simulation of Urban MObility (SUMO) open source software. The
simulation shows that Nash equilibrium can be achieved almost surely.
Keywords: Nash Equilibrium, Multi-agent Model, Stochastic Congestion Game
1. INTRODUCTION
Traditional equilibrium models have been widely used as a modeling tool in traffic assignment.
The governing solution concept in these models is the Wardrop equilibrium. A solution to a
traffic assignment problem is a situation in which travel demand and travel supply is consistent
with each other; traffic equilibria are mathematically described in terms of a fixed point
(Nonoyama and Miyagi, 1982; Miyagi et al., 1991) where the interaction of the travel demand
and travel supply doesnβt change the input or the outcome. This equilibrium is described by
either the user equilibrium (UE) or the stochastic user equilibrium (SUE).
A user equilibrium (UE) suggests that the flow on a route in a transportation network is
zero if the route has non-minimal cost (Wardrop, 1952). Hence, a UE is attained when all users
are on the routes with minimal costs. An analystβs interpretation of a UE would be based on the
userβs perspective where a user can estimate the current best route in the transportation network.
This would imply that link travel time functions are common knowledge, route choices can be
observed and users calculate their best route choice based on this information (best-response).
However, assuming that users have the ability to calculate the current best route is highly
unrealistic and computationally expensive. An alternative approach is to relax these
assumptions and not require the best (optimal) route but rather to consider a userβs βperceivedβ
best route, caused by a user-specific random utility term, while maintaining the common
knowledge assumption (a user knows the distribution of her random utility as well). The process
requires the distribution of demand onto the routes based on the different route cost perceptions
of each user. Route flows fulfill some distribution and flows are shifted towards the desired
2. Journal of the Eastern Asia Society for Transportation Studies, Vol.11, 2015
489
route-choice distribution. The shifting of flows happen in a gradual manner (iteratively) until
some stopping criterion is fulfilled indicating that a fixed point has been reached. A stochastic
user equilibrium (SUE) is then obtained when all users take the route of perceived minimal cost
(Daganzo and Sheffi, 1977).
Recently, researchers have focused on the structure of real travel decisions identifying it
as a major contributing factor in travel demand. Travel decisions are based on usersβ reactions
from their interaction with each other which are not accounted for in traditional equilibrium
models. Implementation of the traditional equilibrium models such as UE and SUE focuses on
a single representative of the population, which means that the users being studied are
homogeneous and thus, behavior is invariable. Naturally, to account for real travel decisions,
different representatives from the demand population are required. This increases the level of
detail of the model which consequently increases the degree of heterogeneity of the
transportation network users. Additionally, the traditional equilibrium model treats a
user/traveler (henceforth, we will refer to a βuserβ as βtravelerβ to describe an individual
decision maker representing a single or group of users in the population with specific
characteristics) as a non-atomic particle (infinitely divisible). When the demand model accounts
for the increase of travelers, because of the combinatorial nature of all possible choices a single
traveler encounters during a single day and the non-atomic particle representation of each
traveler, traffic assignment becomes computationally intractable (Nagel and Flotterod, 2012).
To overcome this, a traveler can be interpreted as an atomic particle (a discrete decision
maker or a single agent) representing an individual in the population with a different
characteristic. The demand population can now be represented by multiple decision-makers
(multi-agent model). Flow distributions can then be reinterpreted as choice distributions over
the demand drawn using Monte Carlo techniques which maintains its mathematical
interpretation. With the advancement of computing power, micro-traffic simulators [SUMO,
VISSIM, MATSim, TRANSIMS] are being widely adopted for this purpose. A multi-agent
model typically used in micro-traffic simulation sample travelers with different characteristics
in the population and simulates the travelersβ interactions in the network. Traveler interaction
occurs during each iterative traffic assignment simulation until a stopping criterion is met.
Additionally, a travelerβs choice distribution is reinterpreted as random draws from her own
choice set (i.e. route set, a plan set, and activity chain set). Thus, an iterative solution procedure
in the traditional equilibrium models can be reinterpreted as a day-to-day learning behavioral
loop. An important aspect in traditional equilibrium models is the functional relationship
between link travel times and link flows which arenβt carried over to micro-traffic simulation.
Instead, the cost-flow relationships merely serve as look-up tables (where link travel time
functions are implicitly assumed) rather than as functional relationships. Moreover, the main
advantage of using the traditional equilibrium traffic assignment models is the robustness of its
solution, the Wardrop equilibrium. Therefore, in order to overcome the limitations of the
traditional equilibrium models while preserving its solution concept, there is a need to
reinterpret (rather than change) it. We then turned to game theory in modeling traveler behavior
where we focus on the Nash equilibrium solution concept which consequently implies a
Wardrop equilibrium. For an extensive review on game theoryβs development and application,
the readers are referred to Tadelis (2012) and Fudenberg and Levine (1998).
Miyagi and Peque (2012) proposed a game theoretical model which accounts for the
adaptive behavior of players (travelers) in a transportation network. In addition, the authors
defined three classes of players, a.) Partially informed-users (PIU) with anticipated payoffs, b.)
Partially informed-users with announced payoffs and c.) NaΓ―ve users (NU), as a consequence
of whether playersβ user-specific random utility is known or unknown and whether playersβ
actions can be observed or cannot be observed in addition to the user-specific travel time
3. Journal of the Eastern Asia Society for Transportation Studies, Vol.11, 2015
490
functions. From the stochastic congestion game model the authors have proposed, even though
they believe it is applicable to a dynamic traffic assignment setting, so far they have only
validated their model under a static traffic assignment setting with PIU with anticipated payoffs
and naΓ―ve users. In contrast, this paper focuses on the PIU with announced payoffs and its close
relation to transportation networks with the presence of Traffic Management Centers (TMCs)
that βnowcastβ travel times to all drivers in the transportation network to be used by all drivers
in making route choice decisions for the following day (day-to-day dynamics), a scenario
typical of a transportation network utilizing Intelligent Transportation Systems (ITS). To further
develop this model, we use the Simulation of Urban MObility (SUMO) software to validate it.
A clear motivation in building on this model is the need to develop comprehensive and
sophisticated traffic simulation procedures that include traffic flow simulation in which driversβ
decisions on route choices are interactively connected to the travel times generated by the traffic
simulation. Moreover, the convergence properties in dynamic route choice behavior based on
microscopic simulation are not yet fully established because travel times of the trips generated
by microscopic traffic simulation are not continuous and the expected values of the travel time
functions are not known in advance.
A similar case to the PIU with announced payoffs has been extensively studied in game
theory (Hart and Mas-Colell, 2000; Marden et al., 2009) and reinforcement learning (Borkar,
2008; Miyagi, 2005). In game theory, this is mostly in the better-reply variety of no-regret
algorithms. Hart and Mas-Collelβs (2000) work focused on the convergence to the set of
correlated equilibrium using regret-matching while Marden et alβs. (2009) work strengthened
the guarantees of regret-based learning in weakly acyclic games. They proved convergence to
Nash equilibrium almost surely. Although, playersβ payoffs in these cases are unperturbed (no
additive random utility). On the other hand, reinforcement learning using stochastic
approximation (Robbins and Monro, 1951) was extensively studied by Borkar (2008) and was
applied by Miyagi (2005) to transportation, however, under the continuous player assumption.
Reinforcement learning is normally used when playersβ payoffs are initially unknown and must
be estimated over time due to noisy observations (i.e. corrupted payoffs due to the unobserved
switches in actions by the other players, delay/inaccuracy of the information received, etc.).
This has been used by the authors (Leslie and Collins, 2003; Leslie and Collins, 2005; Leslie
and Collins, 2006; Cominetti et al., 2010; Chapman et al., 2013) we follow but they considered
naΓ―ve users.
Our contribution in this paper is the application of the stochastic congestion game model
with PIU with announced payoffs proposed by the authors (Miyagi and Peque, 2012) to a
simulation-based dynamic traffic assignment simulation. In the simulation, we used a
generalised weakened fictitious play actor-critic algorithm (Leslie and Collins, 2006), proposed
for the naΓ―ve user case, in the PIU with announced payoffs case. However, we slightly modified
the temperature (dispersion or logit) parameter updating scheme by using a regret-based
updating scheme (Miyagi and Peque, 2012; Miyagi et al., 2013) wherein players route choices
are improving, based on their regret, as time progresses which readily justifies the algorithm as
a model of learning. More importantly, our simulation results show that convergence to Nash
equilibrium is achieved almost surely.
The paper progresses as follows: In the next section, we introduce the notations,
definitions and concepts used in game theory and how it is applied to the traffic assignment
problem. In section 3, we introduce the stochastic congestion game model together with the
derivation of some of the updating formulations we use in this paper. We introduce the
generalised weakened fictitious play actor-critic learning model and its development and then
present it in section 4. In section 5, we present the simulation-based dynamic traffic assignment
simulation using the Simulation of Urban MObility (SUMO) software and show that playersβ
4. Journal of the Eastern Asia Society for Transportation Studies, Vol.11, 2015
491
payoffs converge to Nash equilibrium almost surely. In section 6, we present our conclusions.
2. CONGESTION GAMES
In this section, we introduce a game, including its notations and some definitions, describing
the transportation network and its players, we then define the desired outcome of the
corresponding game.
2.1 Notations
Consider a game π’ described by the triple,
π’ = (β, {ππ
, π’π
}πββ
). (2.1.1)
The sets β = {1, β¦ , π, β¦ , πΌ}, where πΌ = |β| and ππ
= {πΆ1
π
, β¦ , πΆπ
π
, β¦ , πΆπ
π
}, where π = |ππ
|,
represent the set of players and the set of actions of each player π, respectively. We use the
notation πΆβπ
β πβπ
to represent the action taken by the opponent(s) of player π, πΆβπ
=
(πΆ1
, β¦ , πΆπβ1
, πΆπ+1
, β¦ , πΆπΌ
)and the action set of her opponent(s), πβπ
= π1
Γ β― Γ ππβ1
Γ
ππ+1
Γ β― Γ ππΌ
. An action profile is a vector denoted by πΆ = (πΆ1
, β¦ , πΆπ
, β¦ , πΆπΌ
) β π = π1
Γ
β― Γ ππ
Γ β― Γ ππΌ
. We use the conventional notation πΆ = (πΆπ
, πΆβπ
) to represent an action
profile to explicitly show an action taken by player π against the actions taken by her
opponent(s), βπ. In this analysis, these sets are assumed finite, non-empty, non-unitary and
time-invariant. In the game π’, each player π represents a driver in the transportation network
choosing among her set of routes represented by ππ
from her origin to her destination. We
sometimes interchangeably use the terms driver, user, traveler and player. The game π’ is
played stage by stage as a repeated game. In a repeated game, each stage π‘ β π = {0,1,2, β¦ } β
β lasts when all the players have chosen an action πΆπ‘
π
denoted by πΆπ‘ = {πΆπ‘
1
, β¦ , πΆπ‘
π
, β¦ , πΆπ‘
πΌ
}.
The payoff of each player π in a one-shot game, π = {0}, is determined by the function
ππ
: π β β. When the one-shot game is repeated finitely or infinitely often, π = {0,1,2, β¦ },
each player π β β observes a sample π°π‘
π
which is the playerβs payoff at stage π‘ expressed as
π°π‘
π
= ππ
(πΆπ‘
π
, πΆπ‘
βπ
). Each playerβs action πΆπ‘
π
at stage π‘ is chosen according to a probability
distribution, ππ‘
π
, which we will refer to as the strategy of player π at stage π‘. A playerβs
strategy at stage π‘ relies only on her observations from stages π = {0,1,2, β¦ , π‘ β 1} which
are dependent on the information restrictions assumed.
We define the empirical frequency of an action selected by player π at stage π‘ as,
ππ‘
π
(πΆπ
) =
1
π‘
β π{πΆπ
π
= πΆπ
}
π‘β1
π =0 , (2.1.2)
where π{β } is the indicator function that takes the value of 1 if the statement in the parenthesis
is true and 0 otherwise.
From the stage payoffs, each player can estimate their action values denoted by,
π±
Μ π‘
π
(πΆ
Μπ
) =
1
π‘
β π{πΆπ
βπ
= πΆβπ
}ππ
(πΆ
Μπ
, πΆπ
βπ
) =
π‘β1
π =0 ππ
(πΆ
Μπ
, ππ
βπ
), βπΆπ
β ππ
. (2.1.3)
An average of the realized payoffs for player π at stage π‘ can then be defined as,
π°
Μ π‘
π
= β ππ
π
ππ
(πΆπ
, ππ
βπ
)
π‘β1
π =0 β ππ
(ππ‘
π
, ππ‘
βπ
), (2.1.4)
where ππ‘
βπ
= (ππ‘
1
, β¦ , ππ‘
πβ1
, ππ‘
π+1
, β¦ , ππ‘
πΌ
). For now, let the empirical frequencies, ππ‘
π
(πΆπ
), βπΆπ
β
ππ
, of player π denote the (empirical) mixed-strategy, ππ‘
π
(πΆπ
) = ππ‘
π
(πΆπ
) β Ξ(ππ
), βπΆπ
β ππ
,
of player π at stage π‘. Consider a discrete-time process where the objective of each player is
to maximize her expected payoff based on her mixed-strategy denoted by,
5. Journal of the Eastern Asia Society for Transportation Studies, Vol.11, 2015
492
maxπππ
(ππ‘
π
, ππ‘
βπ
) = lim
π‘ββ
πΌπ [
1
π‘
β π°π
π
π‘β1
π =0 ] = lim
π‘ββ
πΌπ [π°
Μ π‘
π
]. (2.1.5)
A playerβs strategy ππ
β π΄π
is the function ππ
: π±
Μ π‘
π
β Ξ(ππ
) which induces the set of
probability distributions or mixed-strategies at each stage, {ππ‘
π
}π‘>0
and π΄π
is the set of all
possible strategies of player π. Let π΄ = (π΄1
, β¦ , π΄π
, β¦ , π΄πΌ
) be the set of all strategy profiles.
Whenever the mixed-strategies at stage π‘, ππ‘, induces the same probability distributions, ππ‘
π
β
Ξ(ππ
), βπΆπ
β ππ
, π β β, in the succeeding stages such that it maximizes the playersβ payoffs
and that none of the players can obtain a performance improvement by unilaterally using
another mixed-strategy, it is called a mixed-strategy Nash equilibrium. A mixed-strategy Nash
equilibrium is formally defined as follows.
Definition 2.1.1. (Mixed-strategy Nash equilibrium). In the game π’, a strategy profile
πβ β Ξ(ππ
) is a mixed-strategy Nash equilibrium if it satisfies for all π β β and for all ππ
β
Ξ(ππ
) such that
ππ(πβ
π ,πβ
βπ) β₯ ππ(ππ,πβ
βπ). (2.1.6)
When all players assign a probability 1 to only one action, i.e. ππ
(πΆπ
) = 1 and it satisfies the
condition above, we get a Nash equilibrium in pure strategies which we formally define below.
Definition 2.1.2. (Pure-strategy Nash equilibrium). In the game π’, a strategy profile
πΆβ β ππ
is a pure-strategy Nash equilibrium if it satisfies for all π β β and for all πΆπ
β ππ
,
that
ππ(πΆβ
π ,πΆβ
βπ) β₯ ππ(πΆπ,πΆβ
βπ). (2.1.7)
Nash equilibrium is one of the central solution concepts of game theory. Therefore, one
of the objectives of learning models is to study the kind of behavioral rules that lead to this
equilibrium as a consequence of the long-run, non-equilibrium process of learning.
2.2 Potential Games and Weakly Acyclic Games
We define the transportation network as a traffic game with atomic flow. A traffic game with
atomic flow was first proposed by Rosenthal (1973) and is known to be equivalent to a
(deterministic) congestion game. A congestion game is a special case of potential game
(Monderer and Shapley, 1996) where the incentive of all players to change their strategy can
be expressed using a single global function called the potential function, π. For now, we define
a potential game and its generalizations. A potential game is formally defined as follows.
Definition 2.2.1. (Potential games). A finite πΌ β player game with action sets {ππ
}πββ
and payoff functions {ππ
}πββ is a potential game if for all π β β, for all πΆβπ
β πβπ
, for all pairs
(πΆπ
, πΆ
Μπ
) β ππ
Γ ππ
and for some potential function π: π β β,
ππ
(πΆπ,πΆβπ) β ππ
(πΆ
Μπ
,πΆβπ) = π(πΆπ,πΆβπ) β π (πΆ
Μπ
,πΆβπ). (2.2.1)
This means that each playerβs payoff function is aligned with the potential function.
Additionally, potential games have the finite improvement property (FIP) where any best or
better-response of a player to some action profile increases the potential function and every
path in the best or better-response leads to a Nash equilibrium. The figure 2.2.1 below shows a
game of three players with two actions each, ππ
= {0,1}, and two Nash equilibria (blue nodes).
A node represents the actions chosen by each player while the directed links represent an
improvement of a playerβs payoff. The left figure shows an example of a potential game where
the nodes represent an action profile and each directed link represents an improvement path.
We define a general type of potential game where the playersβ payoff function alignment
with the potential function is relaxed. It is defined as follows.
6. Journal of the Eastern Asia Society for Transportation Studies, Vol.11, 2015
493
Figure 2.2.1. A potential game (left) and a weakly acyclic game (right)
Definition 2.2.2. (Generalized ordinal potential games). A finite πΌ β player game with
action sets {ππ
}πββ
and payoff functions {ππ
}πββ is a generalized ordinal potential game if for
all π β β, for all πΆβπ
β πβπ
, for all pairs (πΆπ
, πΆ
Μπ
) β ππ
Γ ππ
and for some potential function
π: π β β,
ππ
(πΆπ,πΆβπ) β ππ
(πΆ
Μπ
,πΆβπ) > 0 βΉ π(πΆπ,πΆβπ) β π (πΆ
Μπ
,πΆβπ) > 0. (2.2.2)
A generalized ordinal potential game also has the FIP.
A less restrictive class of game which is more general than both the potential and
generalized ordinal potential game which we use in this paper is called a weakly acyclic game.
A weakly acyclic game requires only that at least one playerβs payoff function is aligned with
the potential function. Before defining weakly acyclic games, we first define a better and best-
response action and strategy. This is formally defined as follows.
Definition 2.2.3. (Better-response). An action πΆπ
β ππ
is a better-response of player π
to an action profile (πΆ
Μπ
,πΆβπ) if (πΆπ,πΆβπ) > (πΆ
Μπ
,πΆβπ). A mixed-strategy ππ
β Ξ(ππ
) is a
better-response of player i to a strategy profile (π
Μπ
,πβπ) if (ππ,πβπ) > (π
Μπ
,πβπ).
Definition 2.2.4. (Best-response). An action πΆπ
β ππ
is a best-response of player π to
an action profile πΆβπ
β πβπ
of the other players if πΆπ
β argmaxπΆ
Μπππ
(πΆ
Μπ
, πΆβπ
). A mixed-
strategy ππ
β Ξ(ππ
) is a best-response of player π to a mixed-strategy profile πβπ
β
Ξ(πβπ
) of the other players if ππ
β argmaxπ
Μπππ
(π
Μπ
, πβπ
).
A best-response strategy is normally used when unperturbed payoffs with complete
information is assumed where greedy algorithms can easily be applied. On the other hand,
perturbed payoffs with incomplete information requires a better-response strategy as it relies
on playerβs beliefs (which may not be accurate) about her environment which improves over
time, getting closer to or becoming equal to a best-response, as she gains experience.
We now formally define weakly acyclic games as follows.
Definition 2.2.5. (Weakly acyclic games). A finite πΌ β player game with action sets
{ππ
}πββ
and payoff functions {ππ
}πββ is a weakly acyclic game if there exist a potential
function, π: π β β, with the following property: For any action profile πΆ that is not a Nash
equilibrium, βπ β β with an action πΆπ
β ππ
for all πΆβπ
β πβπ
, for all pairs (πΆπ
, πΆ
Μπ
) β ππ
Γ
ππ
such that,
ππ
(πΆπ,πΆβπ) β ππ
(πΆ
Μπ
,πΆβπ) > 0 and π(πΆπ,πΆβπ) β π (πΆ
Μπ
,πΆβπ) > 0. (2.2.3)
The right figure in figure 2.2.1 shows a weakly acyclic game. The red directed links
represent a loop where at least one of the playerβs payoff function is aligned with the potential
function. Weakly acyclic games are generalizations of the Cournot adjustment process of two
firms (i.e. players). The Cournot adjustment assumes that in each period one firm chooses a
pure strategy that is a best-response to the strategy of the other firm from the previous stage.
111
011
100
110
010
000
001
101 101
111
000
010
001
011
100
110
7. Journal of the Eastern Asia Society for Transportation Studies, Vol.11, 2015
494
Weakly acyclic games doesnβt necessarily have the finite improvement property as shown
above and it was originally defined for better-responses but has been recently also defined for
best-responses (Fabrikant et al., 2013).
2.3 Flows and Costs
We begin with the flow conservation equations in traffic games with atomic flow. For simplicity,
we restrict our analysis to a transportation network with a single origin-destination (OD) pair
connected by a set of paths, π¦ = {1, β¦ , π, β¦ , π}, made up of a subset of links, β β β. We
assume that for all players in the transportation network, the set of available paths are the same
and is defined to be the playersβ action sets, i.e. {πΆ1
π (1),β¦ , πΆπ
π
(π), β¦ , πΆπ
π (π)} β
{1, . . , π, β¦ , π}, βπ β β . To avoid confusion, we drop the path index π in the notation, πΆπ
π
(π),
which means that we use πΆπ
and π interchangeably to denote an action or a path selected by
player π. Path flows are denoted by an π β dimensional vector β = (β(1), . . , β(π), β¦ , β(π))
where each element represents the number of players who chose the path π, β(π) = |{π: πΆπ
}|.
Hence, β β(π) = |πΌ|
πβπ¦ .
A visit to path π by player π at stage π‘ is expressed as,
ππ‘
π(π) = π{πΆπ‘
π
= π}. (2.3.1)
The aggregated path flows at an arbitrary stage π‘ are then defined as follows.
β ππ‘
π(π) = 1
πΆπβππ , (2.3.2)
β ππ‘
π(π) = βπ‘(π)
πββ . (2.3.3)
Let {πΏβ(π)}ββπ denote elements in the link-path incidence matrix and πβ be the flow on
the link β. We can then define the link flows as,
β πΏβ(π)
πβπ¦ βπ‘(π) = πβ,π‘, ββ β β. (2.3.4)
We also use the following notation on link flows,
πβ
π
= β π{β = π}
πβππ , βπ β β, (2.3.5)
β πβ
π
= πβ
πββ . (2.3.6)
Congestion games are a specific class of games in which playersβ payoff functions have
a special structure. Let β = {β1,β2, β¦ } denote a finite set of links. For each link β β β, there is
an associated congestion or travel time function denoted by,
πβ: {0,1,2, β¦ } β β, (2.3.7)
which reflects the travel time for βusingβ the link as a function of the number of players using
that link, β.
The link travel time is given by a real-valued, non-decreasing but not necessarily
continuous differentiable function, πβ(πβ
). The cost of a path π β ππ
chosen by player π at
stage π‘ is defined as,
ππ(π) = β πΏβ(π)(πΎπ
πβ + πΉβ)
βββ , (2.3.8)
where πΎπ
is the value of time for player π and πΉβ is the fare imposed on the link β. We define
the payoff that a player receives when she chooses a path π β ππ
as ππ(π) = βππ(π). Since a
path flow is dependent on the link flows which are also dependent on the number of discrete
players, the payoff function is discontinuous.
3. STOCHASTIC CONGESTION GAMES
3.1 Travel Information and Route Choice
8. Journal of the Eastern Asia Society for Transportation Studies, Vol.11, 2015
495
Following Selten et al. (2004), Miyagi and Peque (2012) introduced different classes of players
defined by the playerβs knowledge about the states of the routes on a traffic network: Partially
informed-users (PIUs) and NaΓ―ve users (NUs).
PIUs are further categorized into partially informed-users with announced payoffs and
partially informed-users with anticipated payoffs. The first type of players are assumed to not
know the structural form of their payoff functions nor any information about the other players.
However, a Transportation Management Center (TMC) announces to all players, in hindsight,
the observed realized payoffs of the actions taken by all the players in the transportation
network. Additionally, payoffs of alternatives actions not taken by the players are also
announced to all players. Therefore, each player can get the realized travel times in all of the
available routes between any O-D pair. On the other hand, for the second type of players, each
player knows the structural form of her own payoff function and is capable of observing the
actions of all the other players at every stage. However, she doesn't know the structural form of
the other playersβ payoff functions. Each player can estimate the expected payoffs that she
would receive by taking other actions different from the action taken at stage π‘ through
exploration where the actions of the other players are held constant. Furthermore, each player
believes that the other players' action selection are based on empirical frequencies. NaΓ―ve users
are more realistic in the sense that the only information available to her is the realized travel
time of the selected route on that day.
We restrict our attention to the equilibrium problem of traffic networks used by PIUs with
announced payoffs in this paper. The assumptions on the PIUs with the announced payoffs
follow the prevailing assumptions used in the traditional route choice models, however, we
assume that the travel time functions (or cost functions) are not common knowledge (it will be
similar if we assume that playersβ true expected payoffs are the same). Furthermore, the PIU
with the announced payoffs can be regarded as a situation where a TMC observes traffic
volumes and vehicle average speeds on each link in the network through sensors allocated in
the system, and computes all possible paths during a specified time period for any origin-
destination pair in the network.
3.2 The Model
For now we set πΎπ
= 1, βπ β β. We define the potential function in a (deterministic) congestion
game which we are trying to minimize as,
minπ(β) = β β πβ(π)
πβ(β)
π=0
ββπΆ . (3.2.1)
The traffic game was shown by Rosenthal (1973) to have at least one pure-strategy Nash
equilibrium.
Lemma 3.2.1. (Rosenthal, 1973). A game with a strictly increasing cost function with
respect to πβ of the form (2.3.8) with a potential function of the form (3.2.1) possess at least
one pure-strategy Nash equilibrium.
A (deterministic) congestion game is an exact potential game. The action πΆπ
is a best-
response of player π when,
β πβ(πβ
βπ
+ 1)
ββπΆ
Μπ > β πβ(πβ
βπ
+ 1),
ββπΆπ πΆπ
β βπΆ
Μπ
β ππ
(3.2.2)
holds. The equation (3.2.2) expresses the exploration process in which each player can compare
the payoffs (or costs) among alternative routes and judge which alternative is the best. This
exploration process might be automatically accomplished by a micro-processor equipped in a
travelerβs own car. Exploration implies that player π knows the structural form of her payoff
9. Journal of the Eastern Asia Society for Transportation Studies, Vol.11, 2015
496
function and hence, the cost function and can search the action values of her alternative actions
provided that the actions of her opponents remain unchanged.
In (deterministic) congestion games, it is assumed that all players have complete and
perfect knowledge of the game. Therefore, the realized payoff is always equal to the expected
payoff which is the result of the best action (route) chosen by player π at each stage which is
denoted as,
π°π‘
π
= ππ
(πΆπ‘
π
, πΆπ‘
βπ
). (3.2.3)
Equation (3.2.3) explicitly shows that player π can get information about the other playersβ
actions, πΆβπ
. To avoid drivers from making deterministic decisions, a stochastic congestion
game is used. In stochastic congestion games the payoff function is perturbed. It is assumed
that the realized payoff consists of the expected payoff ππ
(πΆπ,πΆβπ) and a player-specific
random term ππ‘
π
(πΆπ‘
π
). That is,
π°π‘
π
= ππ
(πΆπ‘
π
, πΆπ‘
βπ
) + ππ‘
π
(πΆπ‘
π
) (3.2.4)
where ππ
is the true expected payoff and ππ‘
π
(πΆπ‘
π
) is a component of the player-specific and
time-dependent noise or private information, ππ‘
π
= (ππ‘
π(1), β¦ , ππ‘
π(π), β¦ , ππ‘
π(ππ‘)). It should be
noted that the realized payoff is defined when all the actions of players who are participants of
the game are observed. Each player believes that the action selection of the other players are
executed based on their mixed-strategies. Therefore, each playerβs strategy is formulated as
follows:
π½
Μπ
(ππ
) = argmax
ππβπππ‘(Ξ(ππ
))
[β ππ
(πΆπ
)ππ
(πΆπ,πβπ) + ππ
ππ
(ππ
)
πΆπβππ ], (3.2.5)
where ππ
> 0 is a smoothing parameter and the function ππ
(ππ
) is known only to player π
and is assumed to be a smooth, strictly differentiable concave function satisfying the boundary
condition that as ππ
approaches the boundary of the simplex, the slope of ππ
becomes infinite.
Fudenberg and Levine (1998) assumed the following entropy function:
ππ
(ππ) = β β ππ(πΆπ) log ππ(πΆπ)
πΆπβππ . (3.2.6)
This formulation generates the so-called smooth best response function:
π½
Μπ
(ππ) =
ππ₯π{ππ(πΆπ,πβπ)/1
ππ
β }
β ππ₯π{ππ(π
π
,πβπ)/1
ππ
β }
π
π
βππ
β Ξ (ππ
). (3.2.7)
Equation (3.2.7) is a map from payoffs to choice probabilities which is the standard choice
probability function from the additive random utility model of discrete choice theory
(McFadden, 1974) where the random utility ππ
is distributed according to the double
exponential function. Miyagi (1983) showed a duality relation between the entropy function
and the satisfaction function (or log-sum function) of the logit model using Frenchelβs duality
theorem while Hofbauer and Sandholm (2002) used an analysis based on the Legendre
transforms. These imply that the log-sum function is the optimized function of the random
utility model and gives a potential function for a stochastic congestion game. However, the
duality holds if and only if the random utility ππ‘
π
(πΆπ‘
π
) is specified by the double exponential
function.
In our PIU with announced payoffs specification, other playersβ actions cannot be
observed and the distribution of the random utility is unknown. Additionally, all stage payoffs
are announced (i.e. chosen and alternative actionsβ payoffs) which is denoted by,
π°π‘
π
= ππ
(πΆπ‘
π
) + ππ‘
π
(πΆπ‘
π
), βπΆπ
β ππ
, βπ‘. (3.2.8)
Hence, payoffs must be estimated. A player tries to maximize her utility by choosing an action
using the Boltzmann-Gibbs action selection procedure with a vanishing temperature parameter,
ππ‘
π
, for her strategy selection given by the equation,
10. Journal of the Eastern Asia Society for Transportation Studies, Vol.11, 2015
497
π½
Μπ
(π¬π‘
π
) =
ππ₯π{π¬π‘
π
(πΆπ)/1
ππ‘
π
β }
β ππ₯π{π¬π‘
π(π
π
)/1
ππ‘
π
β }
ππβππ
, πΆπ β ππ
,π β β, (3.2.9)
and π¬ β learning given by equation (3.2.10) for her payoff estimation,
π¬π‘
π
(πΆπ) = π¬π‘β1
π
(πΆπ) + ππ‘ (π°π‘
π(πΆπ) β π¬π‘β1
π
(πΆπ)), βπΆπ β ππ
. (3.2.10)
To show the equivalency of equations 3.2.7 and 3.2.9, it is necessary for the condition,
βπ¬π‘
π
(πΆπ
) β π°π‘
π
(πΆπ
, πβπ
)β β 0, as π‘ β β a.s. (3.2.11)
to hold.
4. THE PARTIALLY INFORMED-USER ALGORITHM
In this section, we first introduce the generalised weakened fictitious play process and then
proceed with the actor-critic learning algorithm for the day-to-day route-choice of players in
the stochastic congestion game with partially informed-user with announced payoffs.
4.1 The Generalised Weakened Fictitious Play
The generalised weakened fictitious play (GWFP) actor-critic algorithm was proposed by
Leslie and Collins (2006) for the naΓ―ve user case where they proved that with probability 1, the
mixed-strategies follow a GWFP process. However, the generalised weakened fictitious play
was proposed as an extension of the weakened fictitious play by Van der Genugten (2000)
normally considered for games wherein playersβ actions can be observed which is closely
related to the PIU with anticipated payoffs. Additionally, the GWFP process considers a
vanishing best-response perturbation as a mechanism for speeding up the convergence of
fictitious play which implies that strategies are also estimated according to a stochastic
approximation process.
Before we formally define a GWFP process, let ππ
(πβπ
) be the best-response set of
player π to the mixed-strategy πβπ
and let,
ππ
π
(πβπ
) = {πβπ
β β(ππ
): π°π‘
π
(ππ
, πβπ
) β₯ π°π‘
π
(ππ
(πβπ
), πβπ
) β π}. (4.1.1)
That is, the set of player πβs strategies perform not more than π worse than her best-
response. The joint π β best-response to the mixed-strategy profile π is defined as the set,
ππ(π) = ππ
1
(πβπ
), β¦ , ππ
π
(πβπ
), β¦ , ππ
πΌ
(πβπ
). (4.1.2)
Definition 4.1.1. (GWFP process). A generalised weakened fictitious play process is any
process {ππ‘}π‘β₯1, with ππ‘ such that,
ππ‘+1 β (1 β πΌπ‘+1)ππ‘ + πΌπ‘+1(πππ‘
(ππ‘) + ππ‘+1), (4.1.3)
with an πΌπ‘ β 0 and ππ‘ β 0 as π‘ β β,
β πΌπ‘ = β
π‘β₯1 ,
and {ππ‘}π‘β₯1 a sequence of perturbations such that, for any π > 0,
limπ‘ββ supπ{ββ πΌπ +1ππ +1
πβ1
π =π‘ β: β πΌπ +1 β€ π
πβ1
π =π‘ } = 0.
In other words, the current strategies are adapted towards a (possibly perturbed) joint π β
best-response. Leslie and Collins (2006) showed that allowing non-zero ππ‘ , letting πΌπ‘ be
chosen differently and allowing (certain) perturbations does not affect the convergence result.
Lemma 4.1.1. (Leslie and Collins, 2006). The set of limit points of a generalised weakened
fictitious play process is a connected-internally chain-recurrent set of the best response
differential inclusion.
And subsequently presented the following result.
11. Journal of the Eastern Asia Society for Transportation Studies, Vol.11, 2015
498
Lemma 4.1.2. (Leslie and Collins, 2006). Any generalised weakened fictitious play
process will converge to the set of Nash equilibria in potential games.
4.2 The Boltzmann-Gibbs Actor-critic Algorithm
Actor-critic algorithms are normally used in cases where players only obtain payoffs for the
actions they have chosen which is our definition of the naΓ―ve user case. However, due to the
complex nature of the dynamic traffic assignment simulation of PIUs with announced payoffs
we consider, we use it to estimate both the strategies and payoffs of each player.
Definition 4.2.1. (Boltzmann-Gibbs actor-critic algorithm). A Boltzmann-Gibbs actor-
critic algorithm is a process {ππ‘, π¬π‘} such that,
{
ππ‘
π
(πΆπ
) = (1 β πΌπ‘)ππ‘β1
π
+ πΌπ‘π½π‘
π
(πΆπ
)
π¬π‘
π
(πΆπ
) = π¬π‘β1
π
(πΆπ
) + ππ‘ (π°π‘
π
(πΆπ
) β π¬π‘β1
π
(πΆπ
))
, βπΆπ
β ππ
, βπ β β, (4.2.1)
where π½π‘
π
(π¬π‘
π
) is defined by the equation (3.2.9) and the temperature parameter is updated
according to the regret-based updating scheme,
ππ‘
π
= ππ‘β1
π
+
1
π‘
(max([π ]+
, 0) β ππ‘β1
π
), [π ]+
= maxπβππ(π¬π‘
π(π) β π°
Μ π‘
π
) > 0. (4.2.2)
We present our result without proof.
Proposition 4.2.2. Suppose that {ππ‘, π¬π‘} is a Boltzmann-Gibbs actor-critic process for
which,
1. ) πΌπ‘ = (πΆπΌ + π‘)βππΌ where πΆπΌ > 0 and ππΌ β ]0.5,1], (4.2.3)
2. ) ππ‘ = (πΆπ + π‘)βππ where πΆπ > 0 and ππ β ]0.5, ππΌ[, (4.2.4)
3. ) ππ‘
π
is calculated using equation (4.2.2).
Then with probability 1, the ππ‘ follow a generalised weakened fictitious play process.
The regret-based temperature parameter updating scheme is used since it reduces the
exogenous variables unknown to the model. A playerβs regret is directly connected to her
strategy selection and payoffs in which an improving action selection policy should be
dependent upon which is more logical as compared to the difference of the maximum and
minimum estimates used by Leslie and Collins (2006) with the exogenous variable ππ,
ππ‘
π
=
max
πβπππ¬π‘
π(π)βmin
πβπππ¬π‘
π(π)
ππ logπ‘
. (4.2.5)
Additionally, since we are dealing with PIU with announced payoffs, the action counts in the
payoff learning rates of the players used in Leslie and Collins (2006),
ππ‘ = (πΆπ + #π‘
π
(πΆπ‘
π
))
βππ
, #π‘
π
(πΆπ‘
π
) = β π{πΆπ‘
π
= πΆπ
}
π‘ , (4.2.6)
are replaced by just the iteration times since the action counts acted as some sort of unbiased
estimator (Leslie and Collins, 2005) caused by the infrequent updates of action values with low
probabilities which can be viewed as a playerβs way of compensating for the fact that actions
played infrequently do not receive updates of their values, so when they are played, any reward
prediction error must have greater influence on the value than if frequent updates occur.
However, in the PIU with announced payoffs scenario, all action values are updated at each
stage, thus, there is no need for an estimator.
Using the result by Singh et al. (2000) and Leslie and Collins (2006), the goal is to show
that βπ¬π‘
π
β π°π‘
π
(ππ
)β β 0 and ππ‘
π
β 0 as π‘ β β. We can rewrite equation (4.2.2) as,
ππ‘
π
β ππ‘β1
π
(1 β
1
π‘
) =
max[max
πβππ(π¬π‘
π(π)βπ°
Μ π‘
π
),0]
π‘
. (4.2.7)
The second term goes to zero almost surely as π‘ β β if βπ¬π‘
π
β π°π‘
π
(ππ
)β β 0 which makes
ππ‘
π
β 0. So we only need to show that βπ¬π‘
π
β π°π‘
π
(ππ
)β β 0 almost surely as π‘ β β which we
12. Journal of the Eastern Asia Society for Transportation Studies, Vol.11, 2015
499
show through our simulation result.
5. SIMULATION RESULTS
We present a simulation of a transportation network shown in figure 5.1. We assume that each
player has the same set of actions, ππ
= {1,2,3}, βπ β β or set of available routes. We assigned
1000 players to the network composed of a single origin-destination (OD) pair with 3 routes,
i.e., πΌ = 1000. The flow conservation is described by the equation πΌ = β β(π)
πβπ¦ , π¦ =
{1,2,3}.
Figure 5.1. The test network
Table 5.1. Link segment settings
Link segment Length Maximum allowed speed Number of lanes
1 500 meters 13.89 meter per second 2
2 1005 meters 13.89 meter per second 1
3 1005 meters 13.89 meter per second 2
4 1005 meters 13.89 meter per second 2
5 1005 meters 13.89 meter per second 1
6 200 meters 13.89 meter per second 1
7 500 meters 13.89 meter per second 2
The simulation-based dynamic traffic assignment is carried out using the Simulation of
Urban MObility (SUMO) software. SUMO is a free and open traffic simulation suite which has
been available since 2001. SUMO allows modelling of intermodal traffic systems including
road vehicles, public transport and pedestrians.
In the simulation, players use the equations (4.2.1)-(4.2.2) to update their route choices
and payoff estimates. There are 3600 simulation seconds per iteration under 1000 iterations
where players have a player-specific, Poisson-distributed, dynamic departure time. We assume
that speed, flow and density are collected by sensors positioned all throughout the links. Travel
1
2 3
4 5
6
7
1 lane
2 lanes
route 1
route 2
route 3
13. Journal of the Eastern Asia Society for Transportation Studies, Vol.11, 2015
500
times on all routes are announced to all the players in the network where for an unused route,
the free-flow travel time is announced.
The intersection is made up of links 4 and 6 are priority-based where in link 4 is the main
priority. This means that vehicles traversing link 6 will wait for a gap in link 4 before they can
enter link 5. This also occurs at the intersection between links 3 and 5 where link 3 is the priority.
The legal speed limit on each link is 13.89 m/s. We assume that all vehicles accelerate at 0.8
m/s and decelerate at 4.5 m/s. The maximum speed of a vehicle is assumed to be 70 m/s
(achievable speed of the vehicleβs engine). Each player has an imperfection coefficient (sigma)
which is a braking probability that we set to 0.05. To ensure variable vehicle speeds at each
time step, we set the speed deviation parameter to 0.1 which results in a speed distribution
where 95% of the vehicles drive between 80% and 120% of the legal speed limit.
Figure 5.2. Fundamental diagram of the first iteration
Figure 5.2 shows the relationships of the speed, flow and density which compose the
fundamental diagram of traffic flow used to predict the capability of a road system or its
behavior when applying inflow regulation or speed limits. The upper-right figure shows the
speed-density relationship with a negative linear slope which means that as the density increases,
the speed on the link decreases. The line that crosses the speed axis is at the free flow speed
while the line that crosses the density axis is at jam density. The figure shows that the speed
approaches free flow speed as the density approaches zero. As the density increases, the speed
of the vehicles on the links decreases and it reaches zero when the density equals the jam density.
However, link 3 has a positive slope because the road segment comes from a single lane (link
2) which then transfers to a two lane road segment that distributes the incoming vehicles to each
of the lanes and thus, doesnβt cause as much congestion as compared to the other links in the
network.
The flow-density relationship in the lower-right of figure 5.2 follows a triangular shaped
curve which is approximated by a parabolic curve. However, this is inverted in the density axis.
Normally, the flow-density graph is represented by two vectors representing the free flow
velocity (negative slope in the figure) and the congested branch (positive slope in the figure).
The congested branch implies that even though there are more vehicles on the road, the number
of vehicles passing a single point is less than if there were fewer vehicles on the road. Flow on
14. Journal of the Eastern Asia Society for Transportation Studies, Vol.11, 2015
501
the links 3 and 7 are almost unaffected by the increase in density for two reasons, 1.) route 1,
which link 3 belongs to, is the priority route in the intersection where links 3 and 5 intersect,
this means that vehicles using link 3 doesnβt stop to allow vehicles on link 5 and 2.) the flow of
vehicles come from link 2, a single lane link, transferring to link 3, a dual lane link.
The speed-flow diagram on the upper-left of figure 5.2 is used to determine the speed at
which maximum flow occurs which consists of the free flow and congested branches. There is
currently no function that approximates it, however, the linear approximations (looking from
left to right) show that the average speed decreases as the average flow decreases implying that
this is in the congested branch of the speed-flow diagram. Additionally, the approximation on
links 3 and 7 show that these links are almost at optimum flows for the same reasons stated
above.
Figure 5.3. Fundamental diagram of the last iteration
Comparing the fundamental diagrams of figure 5.2 and figure 5.3 dramatically shows that
the players have learned to avoid long travel times. Link 1 is slightly congested due to the fact
that vehicles can change lanes and are inserted into the network randomly between the two
lanes. This means that when a vehicle who is set to use route 3 is inserted in the upper lane this
vehicle needs to wait for a gap in the lower lane to be able to go to link 4. This random insertion
causes a slight congestion on this link. Looking at the speed-flow diagram on figure 5.3, link 5
has a negative slope which isnβt caused by a congestion on link 5 but a long waiting time due
to the priorities in the merging links. Link 5 belongs to route 3 which has a lesser priority
compared to link 3 which belongs to route 1. This causes vehicles using link 5 to wait for a gap
in order to move to link 7. Lastly, no vehicle uses link 6, which belongs to route 2, as this has
a very high travel time. When a vehicle uses route 2, there are two intersections where this
vehicle needs to wait for a gap in order to move to the next link caused by lesser link priorities.
In the intersection between links 4 and 6 in which link 4 has a higher priority, and again in the
intersection between link 3 and 5 in which link 3 has a higher priority.
Figure 5.4 below shows that vehicles immediately realize that route 1 is the best route
choice since link 3 is the priority in the link 3 and 5 intersection. However, the fluctuations that
can be observed in the mean route travel time figure in the middle is caused by vehicles from
route 1 changing to routes 2 or 3. Vehicles who have faster speeds are limited by the speed of
15. Journal of the Eastern Asia Society for Transportation Studies, Vol.11, 2015
502
the vehicles ahead of them in link 2 making the travel time on this specific iteration higher for
this route. Therefore, probabilities for this route decrease in the next iteration.
Figure 5.4. Link and route information
In figure 5.5 below, strategy and payoff learning parameters, πΌ and π, respectively, are
shown to be slowly decreasing to zero as time progresses as required by our result. The
temperature parameter, π , appears to be decreasing to zero which is required to show
convergence to a generalised weakened fictitious play process. More importantly, it shows that
the temperature parameter is player-specific which implies that players are learning and
updating their strategies independent from each other, validating the multi-agent model and that
as this parameter decreases, the probability of choosing the best action increases. This validates
the actor-critic algorithm as a learning model where playersβ strategy selection is improving
due to
perience.
The top figure in figure 5.6 below shows the averaged route probabilities of the selected
routes of all players. This figure is almost similar to the route counts figure (bottom of figure
5.4) because these represents the choice distributions of the players. The only difference is that
these are the βrealβ route probabilities of the players for selecting the action (i.e. mixed
strategies) which makes it slightly lower than the route counts figure. If these where based on
pure-strategies (probability 1), it would be exactly similar to the route counts figure. The middle
and bottom figures show that as time progresses, the distance between the estimated payoffs (-
239.5752278) and average payoff (-239.7410205) is approaching zero which is necessary to
show its equivalency to the case wherein players can actually observe the other playersβ actions
and as a requirement to show convergence to a generalised weakened fictitious play process.
Significantly, it can be observed that even though the information that the players receive are
not very accurate (mean route travel time shown in the middle figure in figure 5.4), the estimates
of the playersβ payoffs and strategies still converge. Furthermore, the simulation has been
carried out more than 50 times and we get the same consistent result (approximately -239.5)
within a reasonable iteration (i.e. after only 200 iterations with 1000 players).
16. Journal of the Eastern Asia Society for Transportation Studies, Vol.11, 2015
503
Figure 5.5. Learning parameters
Figure 5.6. Link and route information
6. CONCLUSION
This paper further developed the stochastic congestion game model proposed by Miyagi and
Peque (2012) by applying it to a simulation-based dynamic traffic assignment simulation with
PIU with announced payoffs. Our motivation is the application of such a model to a
transportation network where a Traffic Management Center (TMC) is present which announces
travel times to all drivers. This scenario is typical in a transportation network where Intelligent
Transportation Systems (ITS) are utilized. Examples of such a scenario is the availability of the
Vehicle Information and Communication System (VICS) technology of Japan or the Traffic
17. Journal of the Eastern Asia Society for Transportation Studies, Vol.11, 2015
504
Message Channel (TMC) technology of Europe.
The motivation for proposing the game theoretical model was the lack of behavioral
realism inherent in the traditional equilibrium models such as the UE and SUE. This lead to the
discretization of demand where users are treated as individual decision-makers making it a
multi-agent model. Since Wardrop equilibrium is applicable only to the case where players are
non-atomic, Nash equilibrium is used which preserves its mathematical interpretation.
The authors (Miyagi and Peque, 2012) have shown that Nash equilibrium can be achieved
in two of the three classes of players they have defined, namely, the PIU with anticipated
payoffs (Miyagi and Peque, 2012) and naΓ―ve users (Miyagi et al., 2013). However, the results
were only shown for the static traffic assignment setting in both papers. Hence, to validate their
model, we tackled the case where players are PIU with announced payoffs under a simulation-
based dynamic traffic assignment simulation. Moreover, players have a player-specific,
Poisson-distributed, dynamic departure time making the problem highly complex. To solve this,
players in the transportation network are learning and updating their strategy and payoff
estimates using an actor-critic algorithm proposed by Leslie and Collins (2006) which we
slightly modified to fit the scenario. Regardless, we obtain the expected results they have
presented which consequently validates the efficacy of the game theoretical model. As an
additional consequence, we are able to analyze the evolution of the playersβ route choices and
behaviors by learning how to use the transportation network. Finally, the simulation shows that
even when the players receive information with noise, convergence to Nash equilibrium is
achieved almost surely within a reasonable iteration interval.
Although the current simulation results are restrictive, these are significant. It showed that
the multi-agent model is capable of including player-specific attributes in the traffic simulation
and route choice model, simultaneously and it is likely to have the global convergence property
inherent in the usual traffic environment.
Our next step is to apply our methods to a more sophisticated network (i.e. a larger
network, a network with traffic lights and loop detectors, etc.). Furthermore, we are interested
in extending the simulation to the naΓ―ve user setting. The naΓ―ve user setting closely resembles
the assumptions currently used in micro-traffic simulation models and is a more realistic and
plausible model of current transportation networks.
ACKNOWLEDGEMENT
This research is supported by MEXT Grants-in-Aid for Scientific Research, No. 26420511, for
the term 2014-2016.
REFERENCES
Borkar, V. (2008) Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge
University Press.
Chapman, A., Leslie, D., Rogers, A., Jennings, N. (2013) Convergent Learning Algorithms for
Unknown Reward Games. SIAM Journal on Control and Optimization 2013 51:4, 3154-3180.
Cominetti, R., Melo, E., Sorin, S. (2010) A payoff-based learning procedure and its application
to traffic games. Games and Economic Behavior, 70, pp.71-83.
Daganzo, C., Sheffi, Y. (1977) On stochastic models of traffic assignment. Transpn. Sci. 11,
253-274.
Fabrikant, A., Jaggard, A., Schapira, M. (2013) On the Structure of Weakly Acyclic Games.
18. Journal of the Eastern Asia Society for Transportation Studies, Vol.11, 2015
505
Theory of Computing Systems 53, 107-122.
Fudenberg, D., Levine, D. (1998) The Theory of Learning in Games. The MIT Press, Cambridge,
MA, USA.
Hart, S., Mas-Colell, A. (2000) A simple adaptive procedure leading to a correlated equilibrium.
Econometrica, 68:1127-1150.
Hofbauer, J., Sandholm, W. (2002) On the global convergence of stochastic fictitious play.
Econometrica 70, 2265-2294.
Leslie, D., Collins E. (2003) Convergent multiple-timescales reinforcement learning algorithm
in normal form games. Ann., Appl. Probab., 13, pp. 1231-1251.
Leslie, D., Collins E. (2005) Individual Q-learning in normal form games. SIAM J. Control
Optim, 44(2), pp. 495-514.
Leslie, D., Collins E. (2006) Generalised weakened fictitious play. Games and Economic
Behavior, 56:285β298.
Marden, J., Young, P., Arslan, G., Shamma, J. (2009) Payoff-based dynamics for multiplayer
weakly acyclic games. SIAM J. Control and Optimization, 48(1).
McFadden, D. (1974) Conditional logit analysis of qualitative choice-behavior. In Zarembka P,
(Ed.) Frontiers in econometrics, Academic Press, New York.
Miyagi, T. (1983) Dual approach to the modal equilibrium problem. Technical Report, N0.83-
TE-MT3-8, Dept. of Civil Engineering, Gifu University.
Miyagi, T. (2005) Stochastic fictitious play, reinforcement learning and the user equilibrium in
transportation networks. A paper presented at the IVth meeting on "Mathematics in
Transport", University College London.
Miyagi, T., Peque, G. (2012) Informed user algorithm that converge to a pure Nash equilibrium
in traffic games. Procedia- Social and Behavioral Sciences, Volume 54, 4 October, pp. 438β
449.
Miyagi, T., Peque, G., Fukumoto, J. (2013) Adaptive Learning Algorithms for Traffic Games
with Naive Users. Procedia - Social and Behavioral Sciences, Volume 80, 7 June, Pages 806-
817.
Miyagi, T., Ohno, E., Morisugi, H. (1991) A fixed point algorithm for solving the traffic
equilibria. Studies of Regional Sciences, No.21οΌpp. 229-246.
Monderer, D., Shapley, L. (1996) Potential games. Games and Economic Behavior, 14:124β
143.
Nagel, K., Flotterod, G. (2012) Agent-based traffic assignment: going from trips to behavioral
travelers. in R. Pendyala and C. Bhat (eds), Travel Behaviour Research in an Evolving World,
Emerald Group Publishing, Bingley, UK, pp. 261-293.
Nonoyama, H., Miyagi, T. (1982) A fixed point approach to the supply-demand equilibrium
problem in traffic network. Proc. of Infrastructure Planning.
Robbins, H., Monro, S. (1951) A Stochastic Approximation Method. The Annals of
Mathematical Statistics 22 (3): 400.
Rosenthal, R. (1973) A class of games possessing pure-strategy Nash equilibria. International
Journal of Game Theory 2: 65β67.
Selten, R., Schreckenberg, M., Chmura, T., Pitz, T., Kube, S., Hafstein, S., Chrobok, R.,
Pottmeier, A., Wahle, J. (2004) Experimental investigation of day-to-day route-choice
behaviour and network simulations of autobahn traffic in North Rhine-Westphalia. In:
Schreckenberg A, Selten R, editors, Human Behaviour and Traffic Networks. Springer,
Berlin Heidelberg, pp. 1-21.
Singh, S., Jaakola, T., Littman, M., Szepesvari, C. (2000) Convergence results for single-step
on-policy reinforcement-learning algorithms. Machine Learning 38, 287-308.
Tadelis, S. (2012) Game Theory: An Introduction. Economics Books, Princeton University
19. Journal of the Eastern Asia Society for Transportation Studies, Vol.11, 2015
506
Press, edition 1, volume 1, number 10001.
Van der Genugten, B. (2000) Aweakened form of fictitious play in two-person zero-sum games.
Int. Game Theory Rev. 2, 307-328.
Wardrop, J. (1952) Some theoretical aspects of road traffic research. In Proceedings of the
Institute of Civil Engineers, Part II, pp. 325-378.