A New Perspective Of Traffic Assignment A Game Theoretical Approach

Journal of the Eastern Asia Society for Transportation Studies, Vol.11, 2015
488
A New Perspective of Traffic Assignment: A Game Theoretical Approach
Genaro PEQUE, Jr. a
, Toshihiko MIYAGI b
, Fumitaka KURAUCHI c
a,b,c
Department of Civil Engineering, Gifu University, Gifu, 501-1193, Japan
a
E-mail: gpequejr@ gifu-u.ac.jp
b
E-mail: t_miyagi@gifu-u.ac.jp
c
E-mail: kurauchi@gifu-u.ac.jp
Abstract: Traditional equilibrium models consider transportation networks with well-defined
link travel time functions and continuous drivers. Recently, researchers focused on adding the
behavioral dimension lacking in traditional equilibrium models by treating drivers as individual
decision-makers (atomic drivers). However, there is currently no underpinning theory that
supports the shift from macroscopic to microscopic traffic assignment modeling.
In this paper, a game theoretical model which provides this link is presented. We will
show that this model describe drivers’ adaptive behaviors as they perform day-to-day route
choices. Drivers acquire payoffs with unknown noise of their chosen and alternative routes.
This scenario describes a transportation network with the presence of a Traffic Management
Center (TMC).
Finally, a simulation-based dynamic traffic assignment simulation is carried out to
validate the model using the Simulation of Urban MObility (SUMO) open source software. The
simulation shows that Nash equilibrium can be achieved almost surely.
Keywords: Nash Equilibrium, Multi-agent Model, Stochastic Congestion Game
1. INTRODUCTION
Traditional equilibrium models have been widely used as a modeling tool in traffic assignment.
The governing solution concept in these models is the Wardrop equilibrium. A solution to a
traffic assignment problem is a situation in which travel demand and travel supply is consistent
with each other; traffic equilibria are mathematically described in terms of a fixed point
(Nonoyama and Miyagi, 1982; Miyagi et al., 1991) where the interaction of the travel demand
and travel supply doesn’t change the input or the outcome. This equilibrium is described by
either the user equilibrium (UE) or the stochastic user equilibrium (SUE).
A user equilibrium (UE) suggests that the flow on a route in a transportation network is
zero if the route has non-minimal cost (Wardrop, 1952). Hence, a UE is attained when all users
are on the routes with minimal costs. An analyst’s interpretation of a UE would be based on the
user’s perspective where a user can estimate the current best route in the transportation network.
This would imply that link travel time functions are common knowledge, route choices can be
observed and users calculate their best route choice based on this information (best-response).
However, assuming that users have the ability to calculate the current best route is highly
unrealistic and computationally expensive. An alternative approach is to relax these
assumptions and not require the best (optimal) route but rather to consider a user’s “perceived”
best route, caused by a user-specific random utility term, while maintaining the common
knowledge assumption (a user knows the distribution of her random utility as well). The process
requires the distribution of demand onto the routes based on the different route cost perceptions
of each user. Route flows fulfill some distribution and flows are shifted towards the desired

489
route-choice distribution. The shifting of flows happen in a gradual manner (iteratively) until
some stopping criterion is fulfilled indicating that a fixed point has been reached. A stochastic
user equilibrium (SUE) is then obtained when all users take the route of perceived minimal cost
(Daganzo and Sheffi, 1977).
Recently, researchers have focused on the structure of real travel decisions identifying it
as a major contributing factor in travel demand. Travel decisions are based on users’ reactions
from their interaction with each other which are not accounted for in traditional equilibrium
models. Implementation of the traditional equilibrium models such as UE and SUE focuses on
a single representative of the population, which means that the users being studied are
homogeneous and thus, behavior is invariable. Naturally, to account for real travel decisions,
different representatives from the demand population are required. This increases the level of
detail of the model which consequently increases the degree of heterogeneity of the
transportation network users. Additionally, the traditional equilibrium model treats a
user/traveler (henceforth, we will refer to a “user” as “traveler” to describe an individual
decision maker representing a single or group of users in the population with specific
characteristics) as a non-atomic particle (infinitely divisible). When the demand model accounts
for the increase of travelers, because of the combinatorial nature of all possible choices a single
traveler encounters during a single day and the non-atomic particle representation of each
traveler, traffic assignment becomes computationally intractable (Nagel and Flotterod, 2012).
To overcome this, a traveler can be interpreted as an atomic particle (a discrete decision
maker or a single agent) representing an individual in the population with a different
characteristic. The demand population can now be represented by multiple decision-makers
(multi-agent model). Flow distributions can then be reinterpreted as choice distributions over
the demand drawn using Monte Carlo techniques which maintains its mathematical
interpretation. With the advancement of computing power, micro-traffic simulators [SUMO,
VISSIM, MATSim, TRANSIMS] are being widely adopted for this purpose. A multi-agent
model typically used in micro-traffic simulation sample travelers with different characteristics
in the population and simulates the travelers’ interactions in the network. Traveler interaction
occurs during each iterative traffic assignment simulation until a stopping criterion is met.
Additionally, a traveler’s choice distribution is reinterpreted as random draws from her own
choice set (i.e. route set, a plan set, and activity chain set). Thus, an iterative solution procedure
in the traditional equilibrium models can be reinterpreted as a day-to-day learning behavioral
loop. An important aspect in traditional equilibrium models is the functional relationship
between link travel times and link flows which aren’t carried over to micro-traffic simulation.
Instead, the cost-flow relationships merely serve as look-up tables (where link travel time
functions are implicitly assumed) rather than as functional relationships. Moreover, the main
advantage of using the traditional equilibrium traffic assignment models is the robustness of its
solution, the Wardrop equilibrium. Therefore, in order to overcome the limitations of the
traditional equilibrium models while preserving its solution concept, there is a need to
reinterpret (rather than change) it. We then turned to game theory in modeling traveler behavior
where we focus on the Nash equilibrium solution concept which consequently implies a
Wardrop equilibrium. For an extensive review on game theory’s development and application,
the readers are referred to Tadelis (2012) and Fudenberg and Levine (1998).
Miyagi and Peque (2012) proposed a game theoretical model which accounts for the
adaptive behavior of players (travelers) in a transportation network. In addition, the authors
defined three classes of players, a.) Partially informed-users (PIU) with anticipated payoffs, b.)
Partially informed-users with announced payoffs and c.) Naïve users (NU), as a consequence
of whether players’ user-specific random utility is known or unknown and whether players’
actions can be observed or cannot be observed in addition to the user-specific travel time

490
functions. From the stochastic congestion game model the authors have proposed, even though
they believe it is applicable to a dynamic traffic assignment setting, so far they have only
validated their model under a static traffic assignment setting with PIU with anticipated payoffs
and naïve users. In contrast, this paper focuses on the PIU with announced payoffs and its close
relation to transportation networks with the presence of Traffic Management Centers (TMCs)
that “nowcast” travel times to all drivers in the transportation network to be used by all drivers
in making route choice decisions for the following day (day-to-day dynamics), a scenario
typical of a transportation network utilizing Intelligent Transportation Systems (ITS). To further
develop this model, we use the Simulation of Urban MObility (SUMO) software to validate it.
A clear motivation in building on this model is the need to develop comprehensive and
sophisticated traffic simulation procedures that include traffic flow simulation in which drivers’
decisions on route choices are interactively connected to the travel times generated by the traffic
simulation. Moreover, the convergence properties in dynamic route choice behavior based on
microscopic simulation are not yet fully established because travel times of the trips generated
by microscopic traffic simulation are not continuous and the expected values of the travel time
functions are not known in advance.
A similar case to the PIU with announced payoffs has been extensively studied in game
theory (Hart and Mas-Colell, 2000; Marden et al., 2009) and reinforcement learning (Borkar,
2008; Miyagi, 2005). In game theory, this is mostly in the better-reply variety of no-regret
algorithms. Hart and Mas-Collel’s (2000) work focused on the convergence to the set of
correlated equilibrium using regret-matching while Marden et al’s. (2009) work strengthened
the guarantees of regret-based learning in weakly acyclic games. They proved convergence to
Nash equilibrium almost surely. Although, players’ payoffs in these cases are unperturbed (no
additive random utility). On the other hand, reinforcement learning using stochastic
approximation (Robbins and Monro, 1951) was extensively studied by Borkar (2008) and was
applied by Miyagi (2005) to transportation, however, under the continuous player assumption.
Reinforcement learning is normally used when players’ payoffs are initially unknown and must
be estimated over time due to noisy observations (i.e. corrupted payoffs due to the unobserved
switches in actions by the other players, delay/inaccuracy of the information received, etc.).
This has been used by the authors (Leslie and Collins, 2003; Leslie and Collins, 2005; Leslie
and Collins, 2006; Cominetti et al., 2010; Chapman et al., 2013) we follow but they considered
naïve users.
Our contribution in this paper is the application of the stochastic congestion game model
with PIU with announced payoffs proposed by the authors (Miyagi and Peque, 2012) to a
simulation-based dynamic traffic assignment simulation. In the simulation, we used a
generalised weakened fictitious play actor-critic algorithm (Leslie and Collins, 2006), proposed
for the naïve user case, in the PIU with announced payoffs case. However, we slightly modified
the temperature (dispersion or logit) parameter updating scheme by using a regret-based
updating scheme (Miyagi and Peque, 2012; Miyagi et al., 2013) wherein players route choices
are improving, based on their regret, as time progresses which readily justifies the algorithm as
a model of learning. More importantly, our simulation results show that convergence to Nash
equilibrium is achieved almost surely.
The paper progresses as follows: In the next section, we introduce the notations,
definitions and concepts used in game theory and how it is applied to the traffic assignment
problem. In section 3, we introduce the stochastic congestion game model together with the
derivation of some of the updating formulations we use in this paper. We introduce the
generalised weakened fictitious play actor-critic learning model and its development and then
present it in section 4. In section 5, we present the simulation-based dynamic traffic assignment
simulation using the Simulation of Urban MObility (SUMO) software and show that players’

491
payoffs converge to Nash equilibrium almost surely. In section 6, we present our conclusions.
2. CONGESTION GAMES
In this section, we introduce a game, including its notations and some definitions, describing
the transportation network and its players, we then define the desired outcome of the
corresponding game.
2.1 Notations
Consider a game 𝒢 described by the triple,
𝒢 = (ℐ, {𝒜𝑖
, 𝑢𝑖
}𝑖∈ℐ
). (2.1.1)
The sets ℐ = {1, … , 𝑖, … , 𝐼}, where 𝐼 = |ℐ| and 𝒜𝑖
= {𝒶1
𝑖
, … , 𝒶𝑘
𝑖
, … , 𝒶𝑁
𝑖
}, where 𝑁 = |𝒜𝑖
|,
represent the set of players and the set of actions of each player 𝑖, respectively. We use the
notation 𝒶−𝑖
∈ 𝒜−𝑖
to represent the action taken by the opponent(s) of player 𝑖, 𝒶−𝑖
=
(𝒶1
, … , 𝒶𝑖−1
, 𝒶𝑖+1
, … , 𝒶𝐼
)and the action set of her opponent(s), 𝒜−𝑖
= 𝒜1
× ⋯ × 𝒜𝑖−1
×
𝒜𝑖+1
× ⋯ × 𝒜𝐼
. An action profile is a vector denoted by 𝒶 = (𝒶1
, … , 𝒶𝑖
, … , 𝒶𝐼
) ∈ 𝒜 = 𝒜1
×
⋯ × 𝒜𝑖
× ⋯ × 𝒜𝐼
. We use the conventional notation 𝒶 = (𝒶𝑖
, 𝒶−𝑖
) to represent an action
profile to explicitly show an action taken by player 𝑖 against the actions taken by her
opponent(s), −𝑖. In this analysis, these sets are assumed finite, non-empty, non-unitary and
time-invariant. In the game 𝒢, each player 𝑖 represents a driver in the transportation network
choosing among her set of routes represented by 𝒜𝑖
from her origin to her destination. We
sometimes interchangeably use the terms driver, user, traveler and player. The game 𝒢 is
played stage by stage as a repeated game. In a repeated game, each stage 𝑡 ∈ 𝑇 = {0,1,2, … } ⊆
ℕ lasts when all the players have chosen an action 𝒶𝑡
𝑖
denoted by 𝒶𝑡 = {𝒶𝑡
1
, … , 𝒶𝑡
𝑖
, … , 𝒶𝑡
𝐼
}.
The payoff of each player 𝑖 in a one-shot game, 𝑇 = {0}, is determined by the function
𝓊𝑖
: 𝒜 → ℝ. When the one-shot game is repeated finitely or infinitely often, 𝑇 = {0,1,2, … },
each player 𝑖 ∈ ℐ observes a sample 𝒰𝑡
𝑖
which is the player’s payoff at stage 𝑡 expressed as
𝒰𝑡
𝑖
= 𝓊𝑖
(𝒶𝑡
𝑖
, 𝒶𝑡
−𝑖
). Each player’s action 𝒶𝑡
𝑖
at stage 𝑡 is chosen according to a probability
distribution, 𝜋𝑡
𝑖
, which we will refer to as the strategy of player 𝑖 at stage 𝑡. A player’s
strategy at stage 𝑡 relies only on her observations from stages 𝑇 = {0,1,2, … , 𝑡 − 1} which
are dependent on the information restrictions assumed.
We define the empirical frequency of an action selected by player 𝑖 at stage 𝑡 as,
𝓏𝑡
𝑖
(𝒶𝑖
) =
1
𝑡
∑ 𝕀{𝒶𝑠
𝑖
= 𝒶𝑖
}
𝑡−1
𝑠=0 , (2.1.2)
where 𝕀{⋅} is the indicator function that takes the value of 1 if the statement in the parenthesis
is true and 0 otherwise.
From the stage payoffs, each player can estimate their action values denoted by,
𝒱
̅𝑡
𝑖
(𝒶
̃𝑖
) =
1
𝑡
∑ 𝕀{𝒶𝑠
−𝑖
= 𝒶−𝑖
}𝓊𝑖
(𝒶
̃𝑖
, 𝒶𝑠
−𝑖
) =
𝑡−1
𝑠=0 𝓊𝑖
(𝒶
̃𝑖
, 𝓏𝑠
−𝑖
), ∀𝒶𝑖
∈ 𝒜𝑖
. (2.1.3)
An average of the realized payoffs for player 𝑖 at stage 𝑡 can then be defined as,
𝒰
̅𝑡
𝑖
= ∑ 𝓏𝑠
𝑖
𝓊𝑖
(𝒶𝑖
, 𝓏𝑠
−𝑖
)
𝑡−1
𝑠=0 ≔ 𝓊𝑖
(𝓏𝑡
𝑖
, 𝓏𝑡
−𝑖
), (2.1.4)
where 𝓏𝑡
−𝑖
= (𝓏𝑡
1
, … , 𝓏𝑡
𝑖−1
, 𝓏𝑡
𝑖+1
, … , 𝓏𝑡
𝐼
). For now, let the empirical frequencies, 𝓏𝑡
𝑖
(𝒶𝑖
), ∀𝒶𝑖
∈
𝒜𝑖
, of player 𝑖 denote the (empirical) mixed-strategy, 𝓏𝑡
𝑖
(𝒶𝑖
) = 𝜋𝑡
𝑖
(𝒶𝑖
) ∈ Δ(𝒜𝑖
), ∀𝒶𝑖
∈ 𝒜𝑖
,
of player 𝑖 at stage 𝑡. Consider a discrete-time process where the objective of each player is
to maximize her expected payoff based on her mixed-strategy denoted by,

492
max𝜋𝓊𝑖
(𝜋𝑡
𝑖
, 𝜋𝑡
−𝑖
) = lim
𝑡→∞
𝔼𝜋 [
1
𝑡
∑ 𝒰𝑠
𝑖
𝑡−1
𝑠=0 ] = lim
𝑡→∞
𝔼𝜋 [𝒰
̅𝑡
𝑖
]. (2.1.5)
A player’s strategy 𝜎𝑖
∈ 𝛴𝑖
is the function 𝜎𝑖
: 𝒱
̅𝑡
𝑖
→ Δ(𝒜𝑖
) which induces the set of
probability distributions or mixed-strategies at each stage, {𝜋𝑡
𝑖
}𝑡>0
and 𝛴𝑖
is the set of all
possible strategies of player 𝑖. Let 𝛴 = (𝛴1
, … , 𝛴𝑖
, … , 𝛴𝐼
) be the set of all strategy profiles.
Whenever the mixed-strategies at stage 𝑡, 𝜋𝑡, induces the same probability distributions, 𝜋𝑡
𝑖
∈
Δ(𝒜𝑖
), ∀𝒶𝑖
∈ 𝒜𝑖
, 𝑖 ∈ ℐ, in the succeeding stages such that it maximizes the players’ payoffs
and that none of the players can obtain a performance improvement by unilaterally using
another mixed-strategy, it is called a mixed-strategy Nash equilibrium. A mixed-strategy Nash
equilibrium is formally defined as follows.
Definition 2.1.1. (Mixed-strategy Nash equilibrium). In the game 𝒢, a strategy profile
𝜋∗ ∈ Δ(𝒜𝑖
) is a mixed-strategy Nash equilibrium if it satisfies for all 𝑖 ∈ ℐ and for all 𝜋𝑖
∈
Δ(𝒜𝑖
) such that
𝓊𝑖(𝜋∗
𝑖 ,𝜋∗
−𝑖) ≥ 𝓊𝑖(𝜋𝑖,𝜋∗
−𝑖). (2.1.6)
When all players assign a probability 1 to only one action, i.e. 𝜋𝑖
(𝒶𝑖
) = 1 and it satisfies the
condition above, we get a Nash equilibrium in pure strategies which we formally define below.
Definition 2.1.2. (Pure-strategy Nash equilibrium). In the game 𝒢, a strategy profile
𝒶∗ ∈ 𝒜𝑖
is a pure-strategy Nash equilibrium if it satisfies for all 𝑖 ∈ ℐ and for all 𝒶𝑖
∈ 𝒜𝑖
,
that
𝓊𝑖(𝒶∗
𝑖 ,𝒶∗
−𝑖) ≥ 𝓊𝑖(𝒶𝑖,𝒶∗
−𝑖). (2.1.7)
Nash equilibrium is one of the central solution concepts of game theory. Therefore, one
of the objectives of learning models is to study the kind of behavioral rules that lead to this
equilibrium as a consequence of the long-run, non-equilibrium process of learning.
2.2 Potential Games and Weakly Acyclic Games
We define the transportation network as a traffic game with atomic flow. A traffic game with
atomic flow was first proposed by Rosenthal (1973) and is known to be equivalent to a
(deterministic) congestion game. A congestion game is a special case of potential game
(Monderer and Shapley, 1996) where the incentive of all players to change their strategy can
be expressed using a single global function called the potential function, 𝜙. For now, we define
a potential game and its generalizations. A potential game is formally defined as follows.
Definition 2.2.1. (Potential games). A finite 𝐼 − player game with action sets {𝒜𝑖
}𝑖∈ℐ
and payoff functions {𝓊𝑖
}𝑖∈ℐ is a potential game if for all 𝑖 ∈ ℐ, for all 𝒶−𝑖
∈ 𝒜−𝑖
, for all pairs
(𝒶𝑖
, 𝒶
̃𝑖
) ∈ 𝒜𝑖
× 𝒜𝑖
and for some potential function 𝜙: 𝒜 → ℝ,
𝓊𝑖
(𝒶𝑖,𝒶−𝑖) − 𝓊𝑖
(𝒶
̃𝑖
,𝒶−𝑖) = 𝜙(𝒶𝑖,𝒶−𝑖) − 𝜙 (𝒶
̃𝑖
,𝒶−𝑖). (2.2.1)
This means that each player’s payoff function is aligned with the potential function.
Additionally, potential games have the finite improvement property (FIP) where any best or
better-response of a player to some action profile increases the potential function and every
path in the best or better-response leads to a Nash equilibrium. The figure 2.2.1 below shows a
game of three players with two actions each, 𝒜𝑖
= {0,1}, and two Nash equilibria (blue nodes).
A node represents the actions chosen by each player while the directed links represent an
improvement of a player’s payoff. The left figure shows an example of a potential game where
the nodes represent an action profile and each directed link represents an improvement path.
We define a general type of potential game where the players’ payoff function alignment
with the potential function is relaxed. It is defined as follows.

493
Figure 2.2.1. A potential game (left) and a weakly acyclic game (right)
Definition 2.2.2. (Generalized ordinal potential games). A finite 𝐼 − player game with
action sets {𝒜𝑖
}𝑖∈ℐ
}𝑖∈ℐ is a generalized ordinal potential game if for
all 𝑖 ∈ ℐ, for all 𝒶−𝑖
∈ 𝒜−𝑖
, for all pairs (𝒶𝑖
, 𝒶
̃𝑖
) ∈ 𝒜𝑖
× 𝒜𝑖
and for some potential function
𝜙: 𝒜 → ℝ,
𝓊𝑖
(𝒶
̃𝑖
,𝒶−𝑖) > 0 ⟹ 𝜙(𝒶𝑖,𝒶−𝑖) − 𝜙 (𝒶
̃𝑖
,𝒶−𝑖) > 0. (2.2.2)
A generalized ordinal potential game also has the FIP.
A less restrictive class of game which is more general than both the potential and
generalized ordinal potential game which we use in this paper is called a weakly acyclic game.
A weakly acyclic game requires only that at least one player’s payoff function is aligned with
the potential function. Before defining weakly acyclic games, we first define a better and best-
response action and strategy. This is formally defined as follows.
Definition 2.2.3. (Better-response). An action 𝒶𝑖
∈ 𝒜𝑖
is a better-response of player 𝑖
to an action profile (𝒶
̃𝑖
,𝒶−𝑖) if (𝒶𝑖,𝒶−𝑖) > (𝒶
̃𝑖
,𝒶−𝑖). A mixed-strategy 𝜋𝑖
∈ Δ(𝒜𝑖
) is a
better-response of player i to a strategy profile (𝜋
̃𝑖
,𝜋−𝑖) if (𝜋𝑖,𝜋−𝑖) > (𝜋
̃𝑖
,𝜋−𝑖).
Definition 2.2.4. (Best-response). An action 𝒶𝑖
∈ 𝒜𝑖
is a best-response of player 𝑖 to
an action profile 𝒶−𝑖
∈ 𝒜−𝑖
of the other players if 𝒶𝑖
∈ argmax𝒶
̃𝑖𝓊𝑖
(𝒶
̃𝑖
, 𝒶−𝑖
). A mixed-
strategy 𝜋𝑖
∈ Δ(𝒜𝑖
) is a best-response of player 𝑖 to a mixed-strategy profile 𝜋−𝑖
∈
Δ(𝒜−𝑖
) of the other players if 𝜋𝑖
∈ argmax𝜋
̃𝑖𝓊𝑖
(𝜋
̃𝑖
, 𝜋−𝑖
).
A best-response strategy is normally used when unperturbed payoffs with complete
information is assumed where greedy algorithms can easily be applied. On the other hand,
perturbed payoffs with incomplete information requires a better-response strategy as it relies
on player’s beliefs (which may not be accurate) about her environment which improves over
time, getting closer to or becoming equal to a best-response, as she gains experience.
We now formally define weakly acyclic games as follows.
Definition 2.2.5. (Weakly acyclic games). A finite 𝐼 − player game with action sets
{𝒜𝑖
}𝑖∈ℐ
}𝑖∈ℐ is a weakly acyclic game if there exist a potential
function, 𝜙: 𝒜 → ℝ, with the following property: For any action profile 𝒶 that is not a Nash
equilibrium, ∃𝑖 ∈ ℐ with an action 𝒶𝑖
∈ 𝒜𝑖
for all 𝒶−𝑖
∈ 𝒜−𝑖
, for all pairs (𝒶𝑖
, 𝒶
̃𝑖
) ∈ 𝒜𝑖
×
𝒜𝑖
such that,
𝓊𝑖
(𝒶
̃𝑖
,𝒶−𝑖) > 0 and 𝜙(𝒶𝑖,𝒶−𝑖) − 𝜙 (𝒶
̃𝑖
,𝒶−𝑖) > 0. (2.2.3)
The right figure in figure 2.2.1 shows a weakly acyclic game. The red directed links
represent a loop where at least one of the player’s payoff function is aligned with the potential
function. Weakly acyclic games are generalizations of the Cournot adjustment process of two
firms (i.e. players). The Cournot adjustment assumes that in each period one firm chooses a
pure strategy that is a best-response to the strategy of the other firm from the previous stage.
111
011
100
110
010
000
001
101 101
111
000
010
001
011
100
110

494
Weakly acyclic games doesn’t necessarily have the finite improvement property as shown
above and it was originally defined for better-responses but has been recently also defined for
best-responses (Fabrikant et al., 2013).
2.3 Flows and Costs
We begin with the flow conservation equations in traffic games with atomic flow. For simplicity,
we restrict our analysis to a transportation network with a single origin-destination (OD) pair
connected by a set of paths, 𝒦 = {1, … , 𝑘, … , 𝑁}, made up of a subset of links, ℓ ∈ ℒ. We
assume that for all players in the transportation network, the set of available paths are the same
and is defined to be the players’ action sets, i.e. {𝒶1
𝑖 (1),… , 𝒶𝑘
𝑖
(𝑘), … , 𝒶𝑁
𝑖 (𝑁)} ≔
{1, . . , 𝑘, … , 𝑁}, ∀𝑖 ∈ ℐ . To avoid confusion, we drop the path index 𝑘 in the notation, 𝒶𝑘
𝑖
(𝑘),
which means that we use 𝒶𝑖
and 𝑘 interchangeably to denote an action or a path selected by
player 𝑖. Path flows are denoted by an 𝑁 − dimensional vector ℎ = (ℎ(1), . . , ℎ(𝑘), … , ℎ(𝑁))
where each element represents the number of players who chose the path 𝑘, ℎ(𝑘) = |{𝑖: 𝒶𝑖
}|.
Hence, ∑ ℎ(𝑘) = |𝐼|
𝑘∈𝒦 .
A visit to path 𝑘 by player 𝑖 at stage 𝑡 is expressed as,
𝓏𝑡
𝑖(𝑘) = 𝕀{𝒶𝑡
𝑖
= 𝑘}. (2.3.1)
The aggregated path flows at an arbitrary stage 𝑡 are then defined as follows.
∑ 𝓏𝑡
𝑖(𝑘) = 1
𝒶𝑖∈𝒜𝑖 , (2.3.2)
∑ 𝓏𝑡
𝑖(𝑘) = ℎ𝑡(𝑘)
𝑖∈ℐ . (2.3.3)
Let {𝛿ℓ(𝑘)}ℓ∈𝑘 denote elements in the link-path incidence matrix and 𝑓ℓ be the flow on
the link ℓ. We can then define the link flows as,
∑ 𝛿ℓ(𝑘)
𝑘∈𝒦 ℎ𝑡(𝑘) = 𝑓ℓ,𝑡, ∀ℓ ∈ ℒ. (2.3.4)
We also use the following notation on link flows,
𝑓ℓ
𝑖
= ∑ 𝕀{ℓ = 𝑘}
𝑘∈𝒜𝑖 , ∀𝑖 ∈ ℐ, (2.3.5)
∑ 𝑓ℓ
𝑖
= 𝑓ℓ
𝑖∈ℐ . (2.3.6)
Congestion games are a specific class of games in which players’ payoff functions have
a special structure. Let ℒ = {ℓ1,ℓ2, … } denote a finite set of links. For each link ℓ ∈ ℒ, there is
an associated congestion or travel time function denoted by,
𝜏ℓ: {0,1,2, … } → ℝ, (2.3.7)
which reflects the travel time for “using” the link as a function of the number of players using
that link, ℓ.
The link travel time is given by a real-valued, non-decreasing but not necessarily
continuous differentiable function, 𝜏ℓ(𝑓ℓ
). The cost of a path 𝑘 ∈ 𝒜𝑖
chosen by player 𝑖 at
stage 𝑡 is defined as,
𝑐𝑖(𝑘) = ∑ 𝛿ℓ(𝑘)(𝛾𝑖
𝜏ℓ + 𝐹ℓ)
ℓ∈ℒ , (2.3.8)
where 𝛾𝑖
is the value of time for player 𝑖 and 𝐹ℓ is the fare imposed on the link ℓ. We define
the payoff that a player receives when she chooses a path 𝑘 ∈ 𝒜𝑖
as 𝓊𝑖(𝑘) = −𝑐𝑖(𝑘). Since a
path flow is dependent on the link flows which are also dependent on the number of discrete
players, the payoff function is discontinuous.
3. STOCHASTIC CONGESTION GAMES
3.1 Travel Information and Route Choice

495
Following Selten et al. (2004), Miyagi and Peque (2012) introduced different classes of players
defined by the player’s knowledge about the states of the routes on a traffic network: Partially
informed-users (PIUs) and Naïve users (NUs).
PIUs are further categorized into partially informed-users with announced payoffs and
partially informed-users with anticipated payoffs. The first type of players are assumed to not
know the structural form of their payoff functions nor any information about the other players.
However, a Transportation Management Center (TMC) announces to all players, in hindsight,
the observed realized payoffs of the actions taken by all the players in the transportation
network. Additionally, payoffs of alternatives actions not taken by the players are also
announced to all players. Therefore, each player can get the realized travel times in all of the
available routes between any O-D pair. On the other hand, for the second type of players, each
player knows the structural form of her own payoff function and is capable of observing the
actions of all the other players at every stage. However, she doesn't know the structural form of
the other players’ payoff functions. Each player can estimate the expected payoffs that she
would receive by taking other actions different from the action taken at stage 𝑡 through
exploration where the actions of the other players are held constant. Furthermore, each player
believes that the other players' action selection are based on empirical frequencies. Naïve users
are more realistic in the sense that the only information available to her is the realized travel
time of the selected route on that day.
We restrict our attention to the equilibrium problem of traffic networks used by PIUs with
announced payoffs in this paper. The assumptions on the PIUs with the announced payoffs
follow the prevailing assumptions used in the traditional route choice models, however, we
assume that the travel time functions (or cost functions) are not common knowledge (it will be
similar if we assume that players’ true expected payoffs are the same). Furthermore, the PIU
with the announced payoffs can be regarded as a situation where a TMC observes traffic
volumes and vehicle average speeds on each link in the network through sensors allocated in
the system, and computes all possible paths during a specified time period for any origin-
destination pair in the network.
3.2 The Model
For now we set 𝛾𝑖
= 1, ∀𝑖 ∈ ℐ. We define the potential function in a (deterministic) congestion
game which we are trying to minimize as,
min𝜙(ℎ) = ∑ ∑ 𝜏ℓ(𝑚)
𝑓ℓ(ℎ)
𝑚=0
ℓ∈𝒶 . (3.2.1)
The traffic game was shown by Rosenthal (1973) to have at least one pure-strategy Nash
equilibrium.
Lemma 3.2.1. (Rosenthal, 1973). A game with a strictly increasing cost function with
respect to 𝑓ℓ of the form (2.3.8) with a potential function of the form (3.2.1) possess at least
one pure-strategy Nash equilibrium.
A (deterministic) congestion game is an exact potential game. The action 𝒶𝑖
is a best-
response of player 𝑖 when,
∑ 𝜏ℓ(𝑓ℓ
−𝑖
+ 1)
ℓ∈𝒶
̃𝑖 > ∑ 𝜏ℓ(𝑓ℓ
−𝑖
+ 1),
ℓ∈𝒶𝑖 𝒶𝑖
≠ ∀𝒶
̃𝑖
∈ 𝒜𝑖
(3.2.2)
holds. The equation (3.2.2) expresses the exploration process in which each player can compare
the payoffs (or costs) among alternative routes and judge which alternative is the best. This
exploration process might be automatically accomplished by a micro-processor equipped in a
traveler’s own car. Exploration implies that player 𝑖 knows the structural form of her payoff

496
function and hence, the cost function and can search the action values of her alternative actions
provided that the actions of her opponents remain unchanged.
In (deterministic) congestion games, it is assumed that all players have complete and
perfect knowledge of the game. Therefore, the realized payoff is always equal to the expected
payoff which is the result of the best action (route) chosen by player 𝑖 at each stage which is
denoted as,
𝒰𝑡
𝑖
= 𝓊𝑖
(𝒶𝑡
𝑖
, 𝒶𝑡
−𝑖
). (3.2.3)
Equation (3.2.3) explicitly shows that player 𝑖 can get information about the other players’
actions, 𝒶−𝑖
. To avoid drivers from making deterministic decisions, a stochastic congestion
game is used. In stochastic congestion games the payoff function is perturbed. It is assumed
that the realized payoff consists of the expected payoff 𝓊𝑖
(𝒶𝑖,𝒶−𝑖) and a player-specific
random term 𝑒𝑡
𝑖
(𝒶𝑡
𝑖
). That is,
𝒰𝑡
𝑖
= 𝓊𝑖
(𝒶𝑡
𝑖
, 𝒶𝑡
−𝑖
) + 𝑒𝑡
𝑖
(𝒶𝑡
𝑖
) (3.2.4)
where 𝓊𝑖
is the true expected payoff and 𝑒𝑡
𝑖
(𝒶𝑡
𝑖
) is a component of the player-specific and
time-dependent noise or private information, 𝑒𝑡
𝑖
= (𝑒𝑡
𝑖(1), … , 𝑒𝑡
𝑖(𝑘), … , 𝑒𝑡
𝑖(𝑚𝑡)). It should be
noted that the realized payoff is defined when all the actions of players who are participants of
the game are observed. Each player believes that the action selection of the other players are
executed based on their mixed-strategies. Therefore, each player’s strategy is formulated as
follows:
𝛽
̃𝑖
(𝓊𝑖
) = argmax
𝜋𝑖∈𝑖𝑛𝑡(Δ(𝒜𝑖
))
[∑ 𝜋𝑖
(𝒶𝑖
)𝓊𝑖
(𝒶𝑖,𝜋−𝑖) + 𝜇𝑖
𝜓𝑖
(𝜋𝑖
)
𝒶𝑖∈𝒜𝑖 ], (3.2.5)
where 𝜇𝑖
> 0 is a smoothing parameter and the function 𝜓𝑖
(𝜋𝑖
) is known only to player 𝑖
and is assumed to be a smooth, strictly differentiable concave function satisfying the boundary
condition that as 𝜋𝑖
approaches the boundary of the simplex, the slope of 𝜓𝑖
becomes infinite.
Fudenberg and Levine (1998) assumed the following entropy function:
𝜓𝑖
(𝜋𝑖) = − ∑ 𝜋𝑖(𝒶𝑖) log 𝜋𝑖(𝒶𝑖)
𝒶𝑖∈𝒜𝑖 . (3.2.6)
This formulation generates the so-called smooth best response function:
𝛽
̃𝑖
(𝓊𝑖) =
𝑒𝑥𝑝{𝓊𝑖(𝒶𝑖,𝜋−𝑖)/1
𝜇𝑖
⁄ }
∑ 𝑒𝑥𝑝{𝓊𝑖(𝑏
𝑖
,𝜋−𝑖)/1
𝜇𝑖
⁄ }
𝑏
𝑖
∈𝒜𝑖
∈ Δ (𝒜𝑖
). (3.2.7)
Equation (3.2.7) is a map from payoffs to choice probabilities which is the standard choice
probability function from the additive random utility model of discrete choice theory
(McFadden, 1974) where the random utility 𝑒𝑖
is distributed according to the double
exponential function. Miyagi (1983) showed a duality relation between the entropy function
and the satisfaction function (or log-sum function) of the logit model using Frenchel’s duality
theorem while Hofbauer and Sandholm (2002) used an analysis based on the Legendre
transforms. These imply that the log-sum function is the optimized function of the random
utility model and gives a potential function for a stochastic congestion game. However, the
duality holds if and only if the random utility 𝑒𝑡
𝑖
(𝒶𝑡
𝑖
) is specified by the double exponential
function.
In our PIU with announced payoffs specification, other players’ actions cannot be
observed and the distribution of the random utility is unknown. Additionally, all stage payoffs
are announced (i.e. chosen and alternative actions’ payoffs) which is denoted by,
𝒰𝑡
𝑖
= 𝓊𝑖
(𝒶𝑡
𝑖
) + 𝑒𝑡
𝑖
(𝒶𝑡
𝑖
), ∀𝒶𝑖
∈ 𝒜𝑖
, ∀𝑡. (3.2.8)
Hence, payoffs must be estimated. A player tries to maximize her utility by choosing an action
using the Boltzmann-Gibbs action selection procedure with a vanishing temperature parameter,
𝜇𝑡
𝑖
, for her strategy selection given by the equation,

497
𝛽
̃𝑖
(𝒬𝑡
𝑖
) =
𝑒𝑥𝑝{𝒬𝑡
𝑖
(𝒶𝑖)/1
𝜇𝑡
𝑖
⁄ }
∑ 𝑒𝑥𝑝{𝒬𝑡
𝑖(𝑏
𝑖
)/1
𝜇𝑡
𝑖
⁄ }
𝑏𝑖∈𝒜𝑖
, 𝒶𝑖 ∈ 𝒜𝑖
,𝑖 ∈ ℐ, (3.2.9)
and 𝒬 − learning given by equation (3.2.10) for her payoff estimation,
𝒬𝑡
𝑖
(𝒶𝑖) = 𝒬𝑡−1
𝑖
(𝒶𝑖) + 𝜆𝑡 (𝒰𝑡
𝑖(𝒶𝑖) − 𝒬𝑡−1
𝑖
(𝒶𝑖)), ∀𝒶𝑖 ∈ 𝒜𝑖
. (3.2.10)
To show the equivalency of equations 3.2.7 and 3.2.9, it is necessary for the condition,
‖𝒬𝑡
𝑖
(𝒶𝑖
) − 𝒰𝑡
𝑖
(𝒶𝑖
, 𝜋−𝑖
)‖ → 0, as 𝑡 → ∞ a.s. (3.2.11)
to hold.
4. THE PARTIALLY INFORMED-USER ALGORITHM
In this section, we first introduce the generalised weakened fictitious play process and then
proceed with the actor-critic learning algorithm for the day-to-day route-choice of players in
the stochastic congestion game with partially informed-user with announced payoffs.
4.1 The Generalised Weakened Fictitious Play
The generalised weakened fictitious play (GWFP) actor-critic algorithm was proposed by
Leslie and Collins (2006) for the naïve user case where they proved that with probability 1, the
mixed-strategies follow a GWFP process. However, the generalised weakened fictitious play
was proposed as an extension of the weakened fictitious play by Van der Genugten (2000)
normally considered for games wherein players’ actions can be observed which is closely
related to the PIU with anticipated payoffs. Additionally, the GWFP process considers a
vanishing best-response perturbation as a mechanism for speeding up the convergence of
fictitious play which implies that strategies are also estimated according to a stochastic
approximation process.
Before we formally define a GWFP process, let 𝑏𝑖
(𝜋−𝑖
) be the best-response set of
player 𝑖 to the mixed-strategy 𝜋−𝑖
and let,
𝑏𝜀
𝑖
(𝜋−𝑖
) = {𝜋−𝑖
∈ ∆(𝒜𝑖
): 𝒰𝑡
𝑖
(𝜋𝑖
, 𝜋−𝑖
) ≥ 𝒰𝑡
𝑖
(𝑏𝑖
(𝜋−𝑖
), 𝜋−𝑖
) − 𝜀}. (4.1.1)
That is, the set of player 𝑖’s strategies perform not more than 𝜀 worse than her best-
response. The joint 𝜀 − best-response to the mixed-strategy profile 𝜋 is defined as the set,
𝑏𝜀(𝜋) = 𝑏𝜀
1
(𝜋−𝑖
), … , 𝑏𝜀
𝑖
(𝜋−𝑖
), … , 𝑏𝜀
𝐼
(𝜋−𝑖
). (4.1.2)
Definition 4.1.1. (GWFP process). A generalised weakened fictitious play process is any
process {𝜋𝑡}𝑡≥1, with 𝜋𝑡 such that,
𝜋𝑡+1 ∈ (1 − 𝛼𝑡+1)𝜋𝑡 + 𝛼𝑡+1(𝑏𝜀𝑡
(𝜋𝑡) + 𝑀𝑡+1), (4.1.3)
with an 𝛼𝑡 → 0 and 𝜀𝑡 → 0 as 𝑡 → ∞,
∑ 𝛼𝑡 = ∞
𝑡≥1 ,
and {𝑀𝑡}𝑡≥1 a sequence of perturbations such that, for any 𝑇 > 0,
lim𝑡→∞ sup𝑘{‖∑ 𝛼𝑠+1𝑀𝑠+1
𝑘−1
𝑠=𝑡 ‖: ∑ 𝛼𝑠+1 ≤ 𝑇
𝑘−1
𝑠=𝑡 } = 0.
In other words, the current strategies are adapted towards a (possibly perturbed) joint 𝜀 −
best-response. Leslie and Collins (2006) showed that allowing non-zero 𝜀𝑡 , letting 𝛼𝑡 be
chosen differently and allowing (certain) perturbations does not affect the convergence result.
Lemma 4.1.1. (Leslie and Collins, 2006). The set of limit points of a generalised weakened
fictitious play process is a connected-internally chain-recurrent set of the best response
differential inclusion.
And subsequently presented the following result.

498
Lemma 4.1.2. (Leslie and Collins, 2006). Any generalised weakened fictitious play
process will converge to the set of Nash equilibria in potential games.
4.2 The Boltzmann-Gibbs Actor-critic Algorithm
Actor-critic algorithms are normally used in cases where players only obtain payoffs for the
actions they have chosen which is our definition of the naïve user case. However, due to the
complex nature of the dynamic traffic assignment simulation of PIUs with announced payoffs
we consider, we use it to estimate both the strategies and payoffs of each player.
Definition 4.2.1. (Boltzmann-Gibbs actor-critic algorithm). A Boltzmann-Gibbs actor-
critic algorithm is a process {𝜋𝑡, 𝒬𝑡} such that,
{
𝜋𝑡
𝑖
(𝒶𝑖
) = (1 − 𝛼𝑡)𝜋𝑡−1
𝑖
+ 𝛼𝑡𝛽𝑡
𝑖
(𝒶𝑖
)
𝒬𝑡
𝑖
(𝒶𝑖
) = 𝒬𝑡−1
𝑖
(𝒶𝑖
) + 𝜆𝑡 (𝒰𝑡
𝑖
(𝒶𝑖
) − 𝒬𝑡−1
𝑖
(𝒶𝑖
))
, ∀𝒶𝑖
∈ 𝒜𝑖
, ∀𝑖 ∈ ℐ, (4.2.1)
where 𝛽𝑡
𝑖
(𝒬𝑡
𝑖
) is defined by the equation (3.2.9) and the temperature parameter is updated
according to the regret-based updating scheme,
𝜇𝑡
𝑖
= 𝜇𝑡−1
𝑖
+
1
𝑡
(max([𝑅]+
, 0) − 𝜇𝑡−1
𝑖
), [𝑅]+
= max𝑘∈𝒜𝑖(𝒬𝑡
𝑖(𝑘) − 𝒰
̅𝑡
𝑖
) > 0. (4.2.2)
We present our result without proof.
Proposition 4.2.2. Suppose that {𝜋𝑡, 𝒬𝑡} is a Boltzmann-Gibbs actor-critic process for
which,
1. ) 𝛼𝑡 = (𝐶𝛼 + 𝑡)−𝜌𝛼 where 𝐶𝛼 > 0 and 𝜌𝛼 ∈ ]0.5,1], (4.2.3)
2. ) 𝜆𝑡 = (𝐶𝜆 + 𝑡)−𝜌𝜆 where 𝐶𝜆 > 0 and 𝜌𝜆 ∈ ]0.5, 𝜌𝛼[, (4.2.4)
3. ) 𝜇𝑡
𝑖
is calculated using equation (4.2.2).
Then with probability 1, the 𝜋𝑡 follow a generalised weakened fictitious play process.
The regret-based temperature parameter updating scheme is used since it reduces the
exogenous variables unknown to the model. A player’s regret is directly connected to her
strategy selection and payoffs in which an improving action selection policy should be
dependent upon which is more logical as compared to the difference of the maximum and
minimum estimates used by Leslie and Collins (2006) with the exogenous variable 𝜌𝜋,
𝜇𝑡
𝑖
=
max
𝑘∈𝒜𝑖𝒬𝑡
𝑖(𝑘)−min
𝑘∈𝒜𝑖𝒬𝑡
𝑖(𝑘)
𝜌𝜋 log𝑡
. (4.2.5)
Additionally, since we are dealing with PIU with announced payoffs, the action counts in the
payoff learning rates of the players used in Leslie and Collins (2006),
𝜆𝑡 = (𝐶𝜆 + #𝑡
𝑖
(𝒶𝑡
𝑖
))
−𝜌𝜆
, #𝑡
𝑖
(𝒶𝑡
𝑖
) = ∑ 𝕀{𝒶𝑡
𝑖
= 𝒶𝑖
}
𝑡 , (4.2.6)
are replaced by just the iteration times since the action counts acted as some sort of unbiased
estimator (Leslie and Collins, 2005) caused by the infrequent updates of action values with low
probabilities which can be viewed as a player’s way of compensating for the fact that actions
played infrequently do not receive updates of their values, so when they are played, any reward
prediction error must have greater influence on the value than if frequent updates occur.
However, in the PIU with announced payoffs scenario, all action values are updated at each
stage, thus, there is no need for an estimator.
Using the result by Singh et al. (2000) and Leslie and Collins (2006), the goal is to show
that ‖𝒬𝑡
𝑖
− 𝒰𝑡
𝑖
(𝜋𝑖
)‖ → 0 and 𝜇𝑡
𝑖
→ 0 as 𝑡 → ∞. We can rewrite equation (4.2.2) as,
𝜇𝑡
𝑖
− 𝜇𝑡−1
𝑖
(1 −
1
𝑡
) =
max[max
𝑘∈𝒜𝑖(𝒬𝑡
𝑖(𝑘)−𝒰
̅𝑡
𝑖
),0]
𝑡
. (4.2.7)
The second term goes to zero almost surely as 𝑡 → ∞ if ‖𝒬𝑡
𝑖
− 𝒰𝑡
𝑖
(𝜋𝑖
)‖ → 0 which makes
𝜇𝑡
𝑖
→ 0. So we only need to show that ‖𝒬𝑡
𝑖
− 𝒰𝑡
𝑖
(𝜋𝑖
)‖ → 0 almost surely as 𝑡 → ∞ which we

499
show through our simulation result.
5. SIMULATION RESULTS
We present a simulation of a transportation network shown in figure 5.1. We assume that each
player has the same set of actions, 𝒜𝑖
= {1,2,3}, ∀𝑖 ∈ ℐ or set of available routes. We assigned
1000 players to the network composed of a single origin-destination (OD) pair with 3 routes,
i.e., 𝐼 = 1000. The flow conservation is described by the equation 𝐼 = ∑ ℎ(𝑘)
𝑘∈𝒦 , 𝒦 =
{1,2,3}.
Figure 5.1. The test network
Table 5.1. Link segment settings
Link segment Length Maximum allowed speed Number of lanes
1 500 meters 13.89 meter per second 2
The simulation-based dynamic traffic assignment is carried out using the Simulation of
Urban MObility (SUMO) software. SUMO is a free and open traffic simulation suite which has
been available since 2001. SUMO allows modelling of intermodal traffic systems including
road vehicles, public transport and pedestrians.
In the simulation, players use the equations (4.2.1)-(4.2.2) to update their route choices
and payoff estimates. There are 3600 simulation seconds per iteration under 1000 iterations
where players have a player-specific, Poisson-distributed, dynamic departure time. We assume
that speed, flow and density are collected by sensors positioned all throughout the links. Travel
1
2 3
4 5
6
7
1 lane
2 lanes
route 1
route 2
route 3

500
times on all routes are announced to all the players in the network where for an unused route,
the free-flow travel time is announced.
The intersection is made up of links 4 and 6 are priority-based where in link 4 is the main
priority. This means that vehicles traversing link 6 will wait for a gap in link 4 before they can
enter link 5. This also occurs at the intersection between links 3 and 5 where link 3 is the priority.
The legal speed limit on each link is 13.89 m/s. We assume that all vehicles accelerate at 0.8
m/s and decelerate at 4.5 m/s. The maximum speed of a vehicle is assumed to be 70 m/s
(achievable speed of the vehicle’s engine). Each player has an imperfection coefficient (sigma)
which is a braking probability that we set to 0.05. To ensure variable vehicle speeds at each
time step, we set the speed deviation parameter to 0.1 which results in a speed distribution
where 95% of the vehicles drive between 80% and 120% of the legal speed limit.
Figure 5.2. Fundamental diagram of the first iteration
Figure 5.2 shows the relationships of the speed, flow and density which compose the
fundamental diagram of traffic flow used to predict the capability of a road system or its
behavior when applying inflow regulation or speed limits. The upper-right figure shows the
speed-density relationship with a negative linear slope which means that as the density increases,
the speed on the link decreases. The line that crosses the speed axis is at the free flow speed
while the line that crosses the density axis is at jam density. The figure shows that the speed
approaches free flow speed as the density approaches zero. As the density increases, the speed
of the vehicles on the links decreases and it reaches zero when the density equals the jam density.
However, link 3 has a positive slope because the road segment comes from a single lane (link
2) which then transfers to a two lane road segment that distributes the incoming vehicles to each
of the lanes and thus, doesn’t cause as much congestion as compared to the other links in the
network.
The flow-density relationship in the lower-right of figure 5.2 follows a triangular shaped
curve which is approximated by a parabolic curve. However, this is inverted in the density axis.
Normally, the flow-density graph is represented by two vectors representing the free flow
velocity (negative slope in the figure) and the congested branch (positive slope in the figure).
The congested branch implies that even though there are more vehicles on the road, the number
of vehicles passing a single point is less than if there were fewer vehicles on the road. Flow on

501
the links 3 and 7 are almost unaffected by the increase in density for two reasons, 1.) route 1,
which link 3 belongs to, is the priority route in the intersection where links 3 and 5 intersect,
this means that vehicles using link 3 doesn’t stop to allow vehicles on link 5 and 2.) the flow of
vehicles come from link 2, a single lane link, transferring to link 3, a dual lane link.
The speed-flow diagram on the upper-left of figure 5.2 is used to determine the speed at
which maximum flow occurs which consists of the free flow and congested branches. There is
currently no function that approximates it, however, the linear approximations (looking from
left to right) show that the average speed decreases as the average flow decreases implying that
this is in the congested branch of the speed-flow diagram. Additionally, the approximation on
links 3 and 7 show that these links are almost at optimum flows for the same reasons stated
above.
Figure 5.3. Fundamental diagram of the last iteration
Comparing the fundamental diagrams of figure 5.2 and figure 5.3 dramatically shows that
the players have learned to avoid long travel times. Link 1 is slightly congested due to the fact
that vehicles can change lanes and are inserted into the network randomly between the two
lanes. This means that when a vehicle who is set to use route 3 is inserted in the upper lane this
vehicle needs to wait for a gap in the lower lane to be able to go to link 4. This random insertion
causes a slight congestion on this link. Looking at the speed-flow diagram on figure 5.3, link 5
has a negative slope which isn’t caused by a congestion on link 5 but a long waiting time due
to the priorities in the merging links. Link 5 belongs to route 3 which has a lesser priority
compared to link 3 which belongs to route 1. This causes vehicles using link 5 to wait for a gap
in order to move to link 7. Lastly, no vehicle uses link 6, which belongs to route 2, as this has
a very high travel time. When a vehicle uses route 2, there are two intersections where this
vehicle needs to wait for a gap in order to move to the next link caused by lesser link priorities.
In the intersection between links 4 and 6 in which link 4 has a higher priority, and again in the
intersection between link 3 and 5 in which link 3 has a higher priority.
Figure 5.4 below shows that vehicles immediately realize that route 1 is the best route
choice since link 3 is the priority in the link 3 and 5 intersection. However, the fluctuations that
can be observed in the mean route travel time figure in the middle is caused by vehicles from
route 1 changing to routes 2 or 3. Vehicles who have faster speeds are limited by the speed of

502
the vehicles ahead of them in link 2 making the travel time on this specific iteration higher for
this route. Therefore, probabilities for this route decrease in the next iteration.
Figure 5.4. Link and route information
In figure 5.5 below, strategy and payoff learning parameters, 𝛼 and 𝜆, respectively, are
shown to be slowly decreasing to zero as time progresses as required by our result. The
temperature parameter, 𝜇 , appears to be decreasing to zero which is required to show
convergence to a generalised weakened fictitious play process. More importantly, it shows that
the temperature parameter is player-specific which implies that players are learning and
updating their strategies independent from each other, validating the multi-agent model and that
as this parameter decreases, the probability of choosing the best action increases. This validates
the actor-critic algorithm as a learning model where players’ strategy selection is improving
due to
perience.
The top figure in figure 5.6 below shows the averaged route probabilities of the selected
routes of all players. This figure is almost similar to the route counts figure (bottom of figure
5.4) because these represents the choice distributions of the players. The only difference is that
these are the ‘real’ route probabilities of the players for selecting the action (i.e. mixed
strategies) which makes it slightly lower than the route counts figure. If these where based on
pure-strategies (probability 1), it would be exactly similar to the route counts figure. The middle
and bottom figures show that as time progresses, the distance between the estimated payoffs (-
239.5752278) and average payoff (-239.7410205) is approaching zero which is necessary to
show its equivalency to the case wherein players can actually observe the other players’ actions
and as a requirement to show convergence to a generalised weakened fictitious play process.
Significantly, it can be observed that even though the information that the players receive are
not very accurate (mean route travel time shown in the middle figure in figure 5.4), the estimates
of the players’ payoffs and strategies still converge. Furthermore, the simulation has been
carried out more than 50 times and we get the same consistent result (approximately -239.5)
within a reasonable iteration (i.e. after only 200 iterations with 1000 players).

503
Figure 5.5. Learning parameters
Figure 5.6. Link and route information
6. CONCLUSION
This paper further developed the stochastic congestion game model proposed by Miyagi and
Peque (2012) by applying it to a simulation-based dynamic traffic assignment simulation with
PIU with announced payoffs. Our motivation is the application of such a model to a
transportation network where a Traffic Management Center (TMC) is present which announces
travel times to all drivers. This scenario is typical in a transportation network where Intelligent
Transportation Systems (ITS) are utilized. Examples of such a scenario is the availability of the
Vehicle Information and Communication System (VICS) technology of Japan or the Traffic

504
Message Channel (TMC) technology of Europe.
The motivation for proposing the game theoretical model was the lack of behavioral
realism inherent in the traditional equilibrium models such as the UE and SUE. This lead to the
discretization of demand where users are treated as individual decision-makers making it a
multi-agent model. Since Wardrop equilibrium is applicable only to the case where players are
non-atomic, Nash equilibrium is used which preserves its mathematical interpretation.
The authors (Miyagi and Peque, 2012) have shown that Nash equilibrium can be achieved
in two of the three classes of players they have defined, namely, the PIU with anticipated
payoffs (Miyagi and Peque, 2012) and naïve users (Miyagi et al., 2013). However, the results
were only shown for the static traffic assignment setting in both papers. Hence, to validate their
model, we tackled the case where players are PIU with announced payoffs under a simulation-
based dynamic traffic assignment simulation. Moreover, players have a player-specific,
Poisson-distributed, dynamic departure time making the problem highly complex. To solve this,
players in the transportation network are learning and updating their strategy and payoff
estimates using an actor-critic algorithm proposed by Leslie and Collins (2006) which we
slightly modified to fit the scenario. Regardless, we obtain the expected results they have
presented which consequently validates the efficacy of the game theoretical model. As an
additional consequence, we are able to analyze the evolution of the players’ route choices and
behaviors by learning how to use the transportation network. Finally, the simulation shows that
even when the players receive information with noise, convergence to Nash equilibrium is
achieved almost surely within a reasonable iteration interval.
Although the current simulation results are restrictive, these are significant. It showed that
the multi-agent model is capable of including player-specific attributes in the traffic simulation
and route choice model, simultaneously and it is likely to have the global convergence property
inherent in the usual traffic environment.
Our next step is to apply our methods to a more sophisticated network (i.e. a larger
network, a network with traffic lights and loop detectors, etc.). Furthermore, we are interested
in extending the simulation to the naïve user setting. The naïve user setting closely resembles
the assumptions currently used in micro-traffic simulation models and is a more realistic and
plausible model of current transportation networks.
ACKNOWLEDGEMENT
This research is supported by MEXT Grants-in-Aid for Scientific Research, No. 26420511, for
the term 2014-2016.
REFERENCES
Borkar, V. (2008) Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge
University Press.
Chapman, A., Leslie, D., Rogers, A., Jennings, N. (2013) Convergent Learning Algorithms for
Unknown Reward Games. SIAM Journal on Control and Optimization 2013 51:4, 3154-3180.
Cominetti, R., Melo, E., Sorin, S. (2010) A payoff-based learning procedure and its application
to traffic games. Games and Economic Behavior, 70, pp.71-83.
Daganzo, C., Sheffi, Y. (1977) On stochastic models of traffic assignment. Transpn. Sci. 11,
253-274.
Fabrikant, A., Jaggard, A., Schapira, M. (2013) On the Structure of Weakly Acyclic Games.

505
Theory of Computing Systems 53, 107-122.
Fudenberg, D., Levine, D. (1998) The Theory of Learning in Games. The MIT Press, Cambridge,
MA, USA.
Hart, S., Mas-Colell, A. (2000) A simple adaptive procedure leading to a correlated equilibrium.
Econometrica, 68:1127-1150.
Hofbauer, J., Sandholm, W. (2002) On the global convergence of stochastic fictitious play.
Econometrica 70, 2265-2294.
Leslie, D., Collins E. (2003) Convergent multiple-timescales reinforcement learning algorithm
in normal form games. Ann., Appl. Probab., 13, pp. 1231-1251.
Leslie, D., Collins E. (2005) Individual Q-learning in normal form games. SIAM J. Control
Optim, 44(2), pp. 495-514.
Leslie, D., Collins E. (2006) Generalised weakened fictitious play. Games and Economic
Behavior, 56:285–298.
Marden, J., Young, P., Arslan, G., Shamma, J. (2009) Payoff-based dynamics for multiplayer
weakly acyclic games. SIAM J. Control and Optimization, 48(1).
McFadden, D. (1974) Conditional logit analysis of qualitative choice-behavior. In Zarembka P,
(Ed.) Frontiers in econometrics, Academic Press, New York.
Miyagi, T. (1983) Dual approach to the modal equilibrium problem. Technical Report, N0.83-
TE-MT3-8, Dept. of Civil Engineering, Gifu University.
Miyagi, T. (2005) Stochastic fictitious play, reinforcement learning and the user equilibrium in
transportation networks. A paper presented at the IVth meeting on "Mathematics in
Transport", University College London.
Miyagi, T., Peque, G. (2012) Informed user algorithm that converge to a pure Nash equilibrium
in traffic games. Procedia- Social and Behavioral Sciences, Volume 54, 4 October, pp. 438–
449.
Miyagi, T., Peque, G., Fukumoto, J. (2013) Adaptive Learning Algorithms for Traffic Games
with Naive Users. Procedia - Social and Behavioral Sciences, Volume 80, 7 June, Pages 806-
817.
Miyagi, T., Ohno, E., Morisugi, H. (1991) A fixed point algorithm for solving the traffic
equilibria. Studies of Regional Sciences, No.21，pp. 229-246.
Monderer, D., Shapley, L. (1996) Potential games. Games and Economic Behavior, 14:124–
143.
Nagel, K., Flotterod, G. (2012) Agent-based traffic assignment: going from trips to behavioral
travelers. in R. Pendyala and C. Bhat (eds), Travel Behaviour Research in an Evolving World,
Emerald Group Publishing, Bingley, UK, pp. 261-293.
Nonoyama, H., Miyagi, T. (1982) A fixed point approach to the supply-demand equilibrium
problem in traffic network. Proc. of Infrastructure Planning.
Robbins, H., Monro, S. (1951) A Stochastic Approximation Method. The Annals of
Mathematical Statistics 22 (3): 400.
Rosenthal, R. (1973) A class of games possessing pure-strategy Nash equilibria. International
Journal of Game Theory 2: 65–67.
Selten, R., Schreckenberg, M., Chmura, T., Pitz, T., Kube, S., Hafstein, S., Chrobok, R.,
Pottmeier, A., Wahle, J. (2004) Experimental investigation of day-to-day route-choice
behaviour and network simulations of autobahn traffic in North Rhine-Westphalia. In:
Schreckenberg A, Selten R, editors, Human Behaviour and Traffic Networks. Springer,
Berlin Heidelberg, pp. 1-21.
Singh, S., Jaakola, T., Littman, M., Szepesvari, C. (2000) Convergence results for single-step
on-policy reinforcement-learning algorithms. Machine Learning 38, 287-308.
Tadelis, S. (2012) Game Theory: An Introduction. Economics Books, Princeton University

506
Press, edition 1, volume 1, number 10001.
Van der Genugten, B. (2000) Aweakened form of fictitious play in two-person zero-sum games.
Int. Game Theory Rev. 2, 307-328.
Wardrop, J. (1952) Some theoretical aspects of road traffic research. In Proceedings of the
Institute of Civil Engineers, Part II, pp. 325-378.

A New Perspective Of Traffic Assignment A Game Theoretical Approach

Recommended

Recommended

More Related Content

Similar to A New Perspective Of Traffic Assignment A Game Theoretical Approach

Similar to A New Perspective Of Traffic Assignment A Game Theoretical Approach (20)

More from Anita Miller

More from Anita Miller (20)

Recently uploaded

Recently uploaded (20)

A New Perspective Of Traffic Assignment A Game Theoretical Approach