I implemented a multiagent reinforcement learning algorithm for online constrained optimization problems, using Java and R. Several agents had to agree on a common solution for the optimization problem and they had to find the best cooperation network that is beneficial for the group performance.
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
Β
Local coordination in online distributed constraint optimization problems - Paper
1. Local Coordination in Online Distributed Constraint Optimization
Problems
Antonio Maria Fiscarelli1
, Robert Vanden Eynde1
and Erman Loci2
1
Ecole Polytechnique, Universite Libre de Bruxelles, Avenue Franklin Roosevelt 50, 1050 Bruxelles
2
Artificial Intelligence Lab, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium
afiscare@ulb.ac.be
Abstract
For agents to achieve a common goal in multi-agent
systems, often they need to coordinate. One way to
achieve coordiantion is to let agents learn in an joint
action space. Joint Action Learning allows agents to take
into account the actions of other agents, but with an
increased number of agents the action space increases
exponentially. If coordiantion between some agents is
more important than others, than local coordination
allows the agents to coordinate while keeping the
complexity low. In this paper we investigate local
coordination in which agents learn the problem structure,
resulting in a better group performance.
Introduction
In multi-agent system, agents must coordinate to
achieve a jointly optimal payoff. One way to achieve
this coordination is to let agents see the actions that
other agents chose, and based on those actions, the
agents can choose an action that increases the total
payoff. This method of learning is called Joint Action
Learning (JAL). Increasing the number of agents in
JAL, the joint space in which the agents learn increases
exponentially (Claus & Boutilier, 1998), because the
agents have to see the actions of every other agent. Even
though JALs always give the optimal solutions, they are
quite complex to calculate. In this paper we introduce a
method, the Local Joint Action Learning (LJAL) that
solves this complexity problem by sacrificing some of
the solution quality. We let agents see the actions of
only some of the other agents, only of those that are
important, or of those that are necessary.
Local Joint Action Learners
The LJAL approach relies on the concept of a
Coordiantion Graph (CG) (Guestrin, Lagoudakis, &
Parr, 2002), a CG describes action dependencies
between agents. In a CG vertices represent agents, and
egdes represent coordination between those agents (Fig.
1).
Fig.1. An example of a Coordination Graph (CG)
In LJAL the learning problem can be described as a
distributed n-armed bandit problem, where every agent
can choose among n actions, and the reward depends on
the combination of all chosen actions.
Agents estimate rewards according to the following
formula (Sutton & Barto, 1998):
ππ‘+1 π = ππ‘ + πΌ[π π‘ + 1 β ππ‘]
LJAL also keep a probabilistic model of other agents
action selection, they count the number of times C each
action has been chosen by each agent. Agent i maintains
the frequency πΉπ π
π
, that agent j selects action ππ from its
action set π΄π :
πΉπ π
π
=
πΆπ π
π
πΆπ π
π
π π βπ΄ π
The expected value for selecting a specific action i is
calculated as follows:
πΈπ ππ = π(π βͺ {ππ}) πΉπ[π]
π
π
πβπ΄ π
where π΄π
=Γπβπ(π) π΄π and N(i) represents the set of
neighbors of agent i in the CG.
According to Sutton and Barto (1998) the probability
that agent i choses action ππ, at time t is:
Pr ππ =
π πΈπ(π π)/π
π πΈπ(π π)/ππ
π π=1
2. The parameter π expresses how greedy the actions are
being selected.
LJAL performance
We will compare IL, LJAL with randomly generated
CG with out-degree 1 (LJAL-1) for each agent, LJAL
with randomly generated CG with out-degree 2 (LJAL-
2), and LJAL with randomly generated CG with out-
degree 3 (LJAL-3). These were evaluated on randomly
generated distributed bandit problem, for every possible
joint action, a fixed global reward is drawn from a
Normal distribution N(0, 70) (70 = 10 * number of
agents). A single run of the experiment consist of 200
plays, in which 7 agents chose among 4 actions, and
receive a reward for the global joint action as
determined by the problem. Every run LJALs get a new
random graph with the correspoding out-degree. Agents
select their actions with temperature π = 1000 β
0.94play
. The experiment is averaged over 200 runs
(Fig. 2).
Fig. 2. Comparing of IL, LJAL
We can see that the solution quality for IL is the worst,
while with more coordination the reward gets better.
This happened because IL only reason about
themselves, while LJALs take into consideration the
actions of other agents. LJALs are better at the solution
quality but the complexity also increases.
Distributed Constraint Optimization
A Constraint Optimization Problem (COP) describes the
problem of assigning values to a set of variables, subject
to a number of soft constraints. Solving a COP means
maximizing the sum of rewards for every constraint that
are associated with assigning a value to each variable. A
Distributed Constraint Optimization Problem (DCOP) is
a tuple (A, X, D, C, f), where:
A = {π1, π2, β¦ , ππ}, the set of agents
X = {π₯1, π₯2, β¦ , π₯ π }, the set of variables
D = {π·1, π·2, β¦ , π·π}, the set of domains.
Variable π₯π can be assigned values from the
finite domain π·π.
C = {π1, π2, β¦ , π π }, the set of constraints.
Constraint ππ is a function π·π Γ π·π Γ β¦ Γ
π·π β β, with {a,b,β¦,k} β€ (subset) {1, 2, β¦ ,
n}, projecting the domains of a subset of
variables onto a real number, being the reward.
f: X β A, a function mapping variables onto a
single agent
The total reward of a variable assignment S, assigning
value v(π₯π) β π·π to variable π₯π, is:
πΆ π = ππ(π£ π₯ π , π£ π₯ π , β¦ , π£(π₯ π ))
π
π=1
DCOPs are used to model a variety of real problems,
ranging from disaster response scenarios (Chapman et
al. 2011) and distributed sensor network management (
Kho, Rogers, & Jennings, 2009), to traffic management
in congested networks (van Leeuwen, Hesselink, &
Rohling, 2002).
In a DCOP each constraint has its own reward function,
and since the total reward for a solution is the sum of all
rewards, some constraints can have a larger impact on
the solution quality than others. Therefore coordination
between specific agents can be more important than
others. We will investigate the performance of LJALs
on DCOPs where some constraints are more important
than others. We will generate random, fully connected
DCOPs, drawing the rewards of every constraint
function from different normal distributions. We attach
a weight π€π β [0, 1] to each constraint ππ, the problems
variance π is multiplied with this weight when the
reward function for constraint ππ is calculated. The
rewards for constraint ππ are drawn from this
distribution:
N(0, ππ€π)
In our experiment we will compare different LJAL
solving the structure given in Fig. 3. The black edges in
Fig. 3 correspond to weights of 0.9, while the gray
edges correspond to weights of 0.1.
Plays
Reward
Black β IL
Red β LJAL-1
Blue β LJAL-2
Green β LJAL-3
3. Fig. 3. A Weighted CG, darker edges mean more important
constraints, lighter edges mean less important constraints.
In addition to IL and LJAL with random out-degree 2
(LJAL-1), we compare LJALs with a CG matching the
problem structure (LJAL-2), and another LJAL with the
same structure as the problem but with an added edge
between agents 1 and 5 (LJAL-3). From the results
shown below (Fig. 4) we can see that LJAL-2 performs
better than LJAL-1, meaning that a LJAL with a CG
that corresponds to the problem structure gives better
solutions than a LJAL with randomly generated CGs.
We can also see that an added coordination between
agents 1 and 5 in LJAL-3 doesnβt improve the solution
quality. This happens because the extra information on
an unimportant constraint complicates the coordination
on important constraints. According to Taylor et al.
(2011) the increase in teamwork is not necessarily
beneficial to solution quality.
Fig. 4. Comparing IL and LJAL on a distributed constraint
optimization problem.
We make another experiment to test the effect the extra
coordination edge has on solution quality. We switch
LJAL-3 by adding an extra coordination edge between
agents 4 and 7, and removing the edge between agents 1
and 5 (Fig. 5). We can see that the extra coordination
between agent 4 and 7 improved the solution quality,
because agents 4 and 7 were not involved in any
important constraint.
Fig. 5. The effect of an extra coordination edge on solution quality
Learning Coordination Graphs
In the previous experiment we have shown that LJAL
with the same CG as the problem structure perform
better than LJAL with random generated CGs. In the
next experiment we will make the LJAL learn the
optimal CG.
The problem of learning a CG is encoded as a
distributed n-armed bandit problem. Each agent can
choose at most one or two coordination partners.We
map the two-partner selection to an n-armed bandit
problem by making actions represent pairs of agents
instead of single agents. The coordiantion partners are
chosen randomly, and after they are chosen the LJAL
solve the learning problem using that graph. The
resulting reward is used as feedback for chosing the next
coordination partners. This is one play at the meta-
learning level. This process is repeated until the CG
converges. The agents in this meta-bandit problem are
independent learners.
In our experiment we make the agents learn the CG as
proposed in fig. 3. This way we can compare the learned
CG with the known problem structure. One meta-bandit
run consist of 500 plays. In each play the chosen CG is
evaluated in 10 runs of 200 plays. The average of the
reward achieved over 10 runs is the estimated reward
for the chosen CG.
In Fig. 6 we show a CG that agents learned. The
temperature π is decreased to π = 1000 Γ 0.994 ππππ¦
.
The results are averaged over 1000 runs.
Reward
Plays
Black β IL
Red β LJAL-1
Blue β LJAL-2
Green β LJAL-3
Reward
Plays
4. Fig. 6..
This shows that agents can determine which agents are
more important to coordinate with, but we have to
explain how the agents who learn the graph perform
better than those with the same graphs as the problem
structure. Agents that do not coordinate directly are
independent learners relative to each other. These agents
are able to find the optimal reward by climbing, that is
each agent in turn change their action (Guestrin,
Lagoudakis, & Parr, 2002). The starting point is the
highest average reward, and if a global optimal reward
can be achieved by climbing from that point, than
independent learning is enough to find the optimal
reward.
Conclusion
Given a CG we implement a distributed q-learning
algorithm where the agents find the best actions to
maximize the total reward. The only information they
have is what action the agents that he is coordinating
with are taking, and the total reward of their joint action.
In the learning of the CG we implement a q-learning
algorithm where agents learn the best coordination
graph. In this case since it is not distributed the only
information they have is the total reward they get
playing with the current coordination graph.
References
Chapman, A. C., Micillo, R. A., Kota, R., & Jennings, N. R.
(2009, May). Decentralised dynamic task allocation: a
practical game: theoretic approach. In Proceedings of The 8th
International Conference on Autonomous Agents and
Multiagent Systems-Volume 2 (pp. 915-922). International
Foundation for Autonomous Agents and Multiagent Systems.
Claus, C., & Boutilier, C. (1998, July). The dynamics of
reinforcement learning in cooperative multiagent systems. In
AAAI/IAAI (pp. 746-752).
Guestrin, C., Lagoudakis, M., & Parr, R. (2002, July).
Coordinated reinforcement learning. In ICML (Vol. 2, pp.
227-234).
Kho, J., Rogers, A., & Jennings, N. R. (2009). Decentralized
control of adaptive sampling in wireless sensor networks.
ACM Transactions on Sensor Networks (TOSN), 5(3), 19.
Van Leeuwen, P., Hesselink, H., & Rohling, J. (2002).
Scheduling aircraft using constraint satisfaction. Electronic
notes in theoretical computer science, 76, 252-268.
Sutton, R. S., & Barto, A. G. (1998). Introduction to
reinforcement learning. MIT Press.
Taylor, M. E., Jain, M., Tandon, P., Yokoo, M., & Tambe, M.
(2011). Distributed on-line multi-agent optimization under
uncertainty: Balancing exploration and exploitation. Advances
in Complex Systems, 14(03), 471-528.
1
2
3
7
6 5
4