SlideShare a Scribd company logo
Local Coordination in Online Distributed Constraint Optimization
Problems
Antonio Maria Fiscarelli1
, Robert Vanden Eynde1
and Erman Loci2
1
Ecole Polytechnique, Universite Libre de Bruxelles, Avenue Franklin Roosevelt 50, 1050 Bruxelles
2
Artificial Intelligence Lab, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium
afiscare@ulb.ac.be
Abstract
For agents to achieve a common goal in multi-agent
systems, often they need to coordinate. One way to
achieve coordiantion is to let agents learn in an joint
action space. Joint Action Learning allows agents to take
into account the actions of other agents, but with an
increased number of agents the action space increases
exponentially. If coordiantion between some agents is
more important than others, than local coordination
allows the agents to coordinate while keeping the
complexity low. In this paper we investigate local
coordination in which agents learn the problem structure,
resulting in a better group performance.
Introduction
In multi-agent system, agents must coordinate to
achieve a jointly optimal payoff. One way to achieve
this coordination is to let agents see the actions that
other agents chose, and based on those actions, the
agents can choose an action that increases the total
payoff. This method of learning is called Joint Action
Learning (JAL). Increasing the number of agents in
JAL, the joint space in which the agents learn increases
exponentially (Claus & Boutilier, 1998), because the
agents have to see the actions of every other agent. Even
though JALs always give the optimal solutions, they are
quite complex to calculate. In this paper we introduce a
method, the Local Joint Action Learning (LJAL) that
solves this complexity problem by sacrificing some of
the solution quality. We let agents see the actions of
only some of the other agents, only of those that are
important, or of those that are necessary.
Local Joint Action Learners
The LJAL approach relies on the concept of a
Coordiantion Graph (CG) (Guestrin, Lagoudakis, &
Parr, 2002), a CG describes action dependencies
between agents. In a CG vertices represent agents, and
egdes represent coordination between those agents (Fig.
1).
Fig.1. An example of a Coordination Graph (CG)
In LJAL the learning problem can be described as a
distributed n-armed bandit problem, where every agent
can choose among n actions, and the reward depends on
the combination of all chosen actions.
Agents estimate rewards according to the following
formula (Sutton & Barto, 1998):
𝑄𝑡+1 𝑎 = 𝑄𝑡 + 𝛼[𝑟 𝑡 + 1 − 𝑄𝑡]
LJAL also keep a probabilistic model of other agents
action selection, they count the number of times C each
action has been chosen by each agent. Agent i maintains
the frequency 𝐹𝑎 𝑗
𝑖
, that agent j selects action 𝑎𝑗 from its
action set 𝐴𝑗 :
𝐹𝑎 𝑗
𝑖
=
𝐶𝑎 𝑗
𝑗
𝐶𝑏 𝑗
𝑗
𝑏 𝑗 ∈𝐴 𝑗
The expected value for selecting a specific action i is
calculated as follows:
𝐸𝑉 𝑎𝑖 = 𝑄(𝑎 ∪ {𝑎𝑖}) 𝐹𝑎[𝑗]
𝑖
𝑗
𝑎∈𝐴 𝑖
where 𝐴𝑖
=×𝑗∈𝑁(𝑖) 𝐴𝑗 and N(i) represents the set of
neighbors of agent i in the CG.
According to Sutton and Barto (1998) the probability
that agent i choses action 𝑎𝑖, at time t is:
Pr 𝑎𝑖 =
𝑒 𝐸𝑉(𝑎 𝑖)/𝜏
𝑒 𝐸𝑉(𝑏 𝑖)/𝜏𝑛
𝑏 𝑖=1
The parameter 𝜏 expresses how greedy the actions are
being selected.
LJAL performance
We will compare IL, LJAL with randomly generated
CG with out-degree 1 (LJAL-1) for each agent, LJAL
with randomly generated CG with out-degree 2 (LJAL-
2), and LJAL with randomly generated CG with out-
degree 3 (LJAL-3). These were evaluated on randomly
generated distributed bandit problem, for every possible
joint action, a fixed global reward is drawn from a
Normal distribution N(0, 70) (70 = 10 * number of
agents). A single run of the experiment consist of 200
plays, in which 7 agents chose among 4 actions, and
receive a reward for the global joint action as
determined by the problem. Every run LJALs get a new
random graph with the correspoding out-degree. Agents
select their actions with temperature 𝜏 = 1000 ∗
0.94play
. The experiment is averaged over 200 runs
(Fig. 2).
Fig. 2. Comparing of IL, LJAL
We can see that the solution quality for IL is the worst,
while with more coordination the reward gets better.
This happened because IL only reason about
themselves, while LJALs take into consideration the
actions of other agents. LJALs are better at the solution
quality but the complexity also increases.
Distributed Constraint Optimization
A Constraint Optimization Problem (COP) describes the
problem of assigning values to a set of variables, subject
to a number of soft constraints. Solving a COP means
maximizing the sum of rewards for every constraint that
are associated with assigning a value to each variable. A
Distributed Constraint Optimization Problem (DCOP) is
a tuple (A, X, D, C, f), where:
A = {𝑎1, 𝑎2, … , 𝑎𝑙}, the set of agents
X = {𝑥1, 𝑥2, … , 𝑥 𝑛 }, the set of variables
D = {𝐷1, 𝐷2, … , 𝐷𝑙}, the set of domains.
Variable 𝑥𝑖 can be assigned values from the
finite domain 𝐷𝑖.
C = {𝑐1, 𝑐2, … , 𝑐 𝑚 }, the set of constraints.
Constraint 𝑐𝑖 is a function 𝐷𝑎 × 𝐷𝑏 × … ×
𝐷𝑘 → ℝ, with {a,b,…,k} ≤ (subset) {1, 2, … ,
n}, projecting the domains of a subset of
variables onto a real number, being the reward.
f: X → A, a function mapping variables onto a
single agent
The total reward of a variable assignment S, assigning
value v(𝑥𝑖) ∈ 𝐷𝑖 to variable 𝑥𝑖, is:
𝐶 𝑆 = 𝑐𝑖(𝑣 𝑥 𝑎 , 𝑣 𝑥 𝑏 , … , 𝑣(𝑥 𝑘 ))
𝑚
𝑖=1
DCOPs are used to model a variety of real problems,
ranging from disaster response scenarios (Chapman et
al. 2011) and distributed sensor network management (
Kho, Rogers, & Jennings, 2009), to traffic management
in congested networks (van Leeuwen, Hesselink, &
Rohling, 2002).
In a DCOP each constraint has its own reward function,
and since the total reward for a solution is the sum of all
rewards, some constraints can have a larger impact on
the solution quality than others. Therefore coordination
between specific agents can be more important than
others. We will investigate the performance of LJALs
on DCOPs where some constraints are more important
than others. We will generate random, fully connected
DCOPs, drawing the rewards of every constraint
function from different normal distributions. We attach
a weight 𝑤𝑖 ∈ [0, 1] to each constraint 𝑐𝑖, the problems
variance 𝜎 is multiplied with this weight when the
reward function for constraint 𝑐𝑖 is calculated. The
rewards for constraint 𝑐𝑖 are drawn from this
distribution:
N(0, 𝜎𝑤𝑖)
In our experiment we will compare different LJAL
solving the structure given in Fig. 3. The black edges in
Fig. 3 correspond to weights of 0.9, while the gray
edges correspond to weights of 0.1.
Plays
Reward
Black – IL
Red – LJAL-1
Blue – LJAL-2
Green – LJAL-3
Fig. 3. A Weighted CG, darker edges mean more important
constraints, lighter edges mean less important constraints.
In addition to IL and LJAL with random out-degree 2
(LJAL-1), we compare LJALs with a CG matching the
problem structure (LJAL-2), and another LJAL with the
same structure as the problem but with an added edge
between agents 1 and 5 (LJAL-3). From the results
shown below (Fig. 4) we can see that LJAL-2 performs
better than LJAL-1, meaning that a LJAL with a CG
that corresponds to the problem structure gives better
solutions than a LJAL with randomly generated CGs.
We can also see that an added coordination between
agents 1 and 5 in LJAL-3 doesn’t improve the solution
quality. This happens because the extra information on
an unimportant constraint complicates the coordination
on important constraints. According to Taylor et al.
(2011) the increase in teamwork is not necessarily
beneficial to solution quality.
Fig. 4. Comparing IL and LJAL on a distributed constraint
optimization problem.
We make another experiment to test the effect the extra
coordination edge has on solution quality. We switch
LJAL-3 by adding an extra coordination edge between
agents 4 and 7, and removing the edge between agents 1
and 5 (Fig. 5). We can see that the extra coordination
between agent 4 and 7 improved the solution quality,
because agents 4 and 7 were not involved in any
important constraint.
Fig. 5. The effect of an extra coordination edge on solution quality
Learning Coordination Graphs
In the previous experiment we have shown that LJAL
with the same CG as the problem structure perform
better than LJAL with random generated CGs. In the
next experiment we will make the LJAL learn the
optimal CG.
The problem of learning a CG is encoded as a
distributed n-armed bandit problem. Each agent can
choose at most one or two coordination partners.We
map the two-partner selection to an n-armed bandit
problem by making actions represent pairs of agents
instead of single agents. The coordiantion partners are
chosen randomly, and after they are chosen the LJAL
solve the learning problem using that graph. The
resulting reward is used as feedback for chosing the next
coordination partners. This is one play at the meta-
learning level. This process is repeated until the CG
converges. The agents in this meta-bandit problem are
independent learners.
In our experiment we make the agents learn the CG as
proposed in fig. 3. This way we can compare the learned
CG with the known problem structure. One meta-bandit
run consist of 500 plays. In each play the chosen CG is
evaluated in 10 runs of 200 plays. The average of the
reward achieved over 10 runs is the estimated reward
for the chosen CG.
In Fig. 6 we show a CG that agents learned. The
temperature 𝜏 is decreased to 𝜏 = 1000 × 0.994 𝑝𝑙𝑎𝑦
.
The results are averaged over 1000 runs.
Reward
Plays
Black – IL
Red – LJAL-1
Blue – LJAL-2
Green – LJAL-3
Reward
Plays
Fig. 6..
This shows that agents can determine which agents are
more important to coordinate with, but we have to
explain how the agents who learn the graph perform
better than those with the same graphs as the problem
structure. Agents that do not coordinate directly are
independent learners relative to each other. These agents
are able to find the optimal reward by climbing, that is
each agent in turn change their action (Guestrin,
Lagoudakis, & Parr, 2002). The starting point is the
highest average reward, and if a global optimal reward
can be achieved by climbing from that point, than
independent learning is enough to find the optimal
reward.
Conclusion
Given a CG we implement a distributed q-learning
algorithm where the agents find the best actions to
maximize the total reward. The only information they
have is what action the agents that he is coordinating
with are taking, and the total reward of their joint action.
In the learning of the CG we implement a q-learning
algorithm where agents learn the best coordination
graph. In this case since it is not distributed the only
information they have is the total reward they get
playing with the current coordination graph.
References
Chapman, A. C., Micillo, R. A., Kota, R., & Jennings, N. R.
(2009, May). Decentralised dynamic task allocation: a
practical game: theoretic approach. In Proceedings of The 8th
International Conference on Autonomous Agents and
Multiagent Systems-Volume 2 (pp. 915-922). International
Foundation for Autonomous Agents and Multiagent Systems.
Claus, C., & Boutilier, C. (1998, July). The dynamics of
reinforcement learning in cooperative multiagent systems. In
AAAI/IAAI (pp. 746-752).
Guestrin, C., Lagoudakis, M., & Parr, R. (2002, July).
Coordinated reinforcement learning. In ICML (Vol. 2, pp.
227-234).
Kho, J., Rogers, A., & Jennings, N. R. (2009). Decentralized
control of adaptive sampling in wireless sensor networks.
ACM Transactions on Sensor Networks (TOSN), 5(3), 19.
Van Leeuwen, P., Hesselink, H., & Rohling, J. (2002).
Scheduling aircraft using constraint satisfaction. Electronic
notes in theoretical computer science, 76, 252-268.
Sutton, R. S., & Barto, A. G. (1998). Introduction to
reinforcement learning. MIT Press.
Taylor, M. E., Jain, M., Tandon, P., Yokoo, M., & Tambe, M.
(2011). Distributed on-line multi-agent optimization under
uncertainty: Balancing exploration and exploitation. Advances
in Complex Systems, 14(03), 471-528.
1
2
3
7
6 5
4

More Related Content

What's hot

How Product Decision Characteristics Interact to Influence Cognitive Load: An...
How Product Decision Characteristics Interact to Influence Cognitive Load: An...How Product Decision Characteristics Interact to Influence Cognitive Load: An...
How Product Decision Characteristics Interact to Influence Cognitive Load: An...
Pierre-Majorique Léger
 
uai2004_V1.doc.doc.doc
uai2004_V1.doc.doc.docuai2004_V1.doc.doc.doc
uai2004_V1.doc.doc.doc
butest
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
Khaled Saleh
 
PresTrojan0_1212
PresTrojan0_1212PresTrojan0_1212
PresTrojan0_1212
Stanislav Moskovtsev
 
Utilitas
UtilitasUtilitas
Utilitas
Cooper Wesley
 
eBbook: The Fair Pay Act and Implications For Compensation Modeling
eBbook: The Fair Pay Act and Implications For Compensation ModelingeBbook: The Fair Pay Act and Implications For Compensation Modeling
eBbook: The Fair Pay Act and Implications For Compensation Modeling
Thomas Econometrics
 
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Universitat Politècnica de Catalunya
 
Evaluation of subjective answers using glsa enhanced with contextual synonymy
Evaluation of subjective answers using glsa enhanced with contextual synonymyEvaluation of subjective answers using glsa enhanced with contextual synonymy
Evaluation of subjective answers using glsa enhanced with contextual synonymy
ijnlc
 
Pro max icdm2012-slides
Pro max icdm2012-slidesPro max icdm2012-slides
Pro max icdm2012-slides
Laks Lakshmanan
 
Some Studies on Multistage Decision Making Under Fuzzy Dynamic Programming
Some Studies on Multistage Decision Making Under Fuzzy Dynamic ProgrammingSome Studies on Multistage Decision Making Under Fuzzy Dynamic Programming
Some Studies on Multistage Decision Making Under Fuzzy Dynamic Programming
Waqas Tariq
 
Unit.2. linear programming
Unit.2. linear programmingUnit.2. linear programming
Unit.2. linear programming
DagnaygebawGoshme
 
Planning in Markov Stochastic Task Domains
Planning in Markov Stochastic Task DomainsPlanning in Markov Stochastic Task Domains
Planning in Markov Stochastic Task Domains
Waqas Tariq
 

What's hot (12)

How Product Decision Characteristics Interact to Influence Cognitive Load: An...
How Product Decision Characteristics Interact to Influence Cognitive Load: An...How Product Decision Characteristics Interact to Influence Cognitive Load: An...
How Product Decision Characteristics Interact to Influence Cognitive Load: An...
 
uai2004_V1.doc.doc.doc
uai2004_V1.doc.doc.docuai2004_V1.doc.doc.doc
uai2004_V1.doc.doc.doc
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
PresTrojan0_1212
PresTrojan0_1212PresTrojan0_1212
PresTrojan0_1212
 
Utilitas
UtilitasUtilitas
Utilitas
 
eBbook: The Fair Pay Act and Implications For Compensation Modeling
eBbook: The Fair Pay Act and Implications For Compensation ModelingeBbook: The Fair Pay Act and Implications For Compensation Modeling
eBbook: The Fair Pay Act and Implications For Compensation Modeling
 
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
 
Evaluation of subjective answers using glsa enhanced with contextual synonymy
Evaluation of subjective answers using glsa enhanced with contextual synonymyEvaluation of subjective answers using glsa enhanced with contextual synonymy
Evaluation of subjective answers using glsa enhanced with contextual synonymy
 
Pro max icdm2012-slides
Pro max icdm2012-slidesPro max icdm2012-slides
Pro max icdm2012-slides
 
Some Studies on Multistage Decision Making Under Fuzzy Dynamic Programming
Some Studies on Multistage Decision Making Under Fuzzy Dynamic ProgrammingSome Studies on Multistage Decision Making Under Fuzzy Dynamic Programming
Some Studies on Multistage Decision Making Under Fuzzy Dynamic Programming
 
Unit.2. linear programming
Unit.2. linear programmingUnit.2. linear programming
Unit.2. linear programming
 
Planning in Markov Stochastic Task Domains
Planning in Markov Stochastic Task DomainsPlanning in Markov Stochastic Task Domains
Planning in Markov Stochastic Task Domains
 

Viewers also liked

Exposicion
ExposicionExposicion
Exposicion
iaceconsultorias
 
Necessitoquemodevolvas.pps
Necessitoquemodevolvas.ppsNecessitoquemodevolvas.pps
Necessitoquemodevolvas.ppsclaudio paschoal
 
Work experience Starbucks
Work experience StarbucksWork experience Starbucks
Work experience StarbucksRebhey Shaath
 
SISTEMA CLIENTE SERVIDOR
SISTEMA CLIENTE SERVIDORSISTEMA CLIENTE SERVIDOR
SISTEMA CLIENTE SERVIDOR
Byron Duarte
 
Relacion de docentes primaria (1)
Relacion de docentes primaria (1)Relacion de docentes primaria (1)
Relacion de docentes primaria (1)
Aleyda Nancy Mejia Rioja
 
OSHA 5400 Joe Doherty jan 2016
OSHA 5400 Joe Doherty jan 2016OSHA 5400 Joe Doherty jan 2016
OSHA 5400 Joe Doherty jan 2016Joseph Doherty
 
Ficha links de economia
Ficha links de economiaFicha links de economia
Ficha links de economia
John Jeffry Goicochea Silva
 
POSTER FINAL
POSTER FINALPOSTER FINAL
POSTER FINAL
Darryl Minchenko
 
Resultados futbol mayor ldds (dom 22 jul 12)
Resultados futbol mayor ldds (dom 22 jul 12)Resultados futbol mayor ldds (dom 22 jul 12)
Resultados futbol mayor ldds (dom 22 jul 12)
wsfnet
 
MIKE CERTIFICATES
MIKE CERTIFICATES MIKE CERTIFICATES
MIKE CERTIFICATES mike stone
 
Hyello
HyelloHyello
Hyellohl1996
 

Viewers also liked (20)

Exposicion
ExposicionExposicion
Exposicion
 
04 ISGP Approvals
04 ISGP Approvals04 ISGP Approvals
04 ISGP Approvals
 
Necessitoquemodevolvas.pps
Necessitoquemodevolvas.ppsNecessitoquemodevolvas.pps
Necessitoquemodevolvas.pps
 
scan0002
scan0002scan0002
scan0002
 
Work experience Starbucks
Work experience StarbucksWork experience Starbucks
Work experience Starbucks
 
Test
TestTest
Test
 
Ahsen Idrees0001
Ahsen Idrees0001Ahsen Idrees0001
Ahsen Idrees0001
 
SISTEMA CLIENTE SERVIDOR
SISTEMA CLIENTE SERVIDORSISTEMA CLIENTE SERVIDOR
SISTEMA CLIENTE SERVIDOR
 
Relacion de docentes primaria (1)
Relacion de docentes primaria (1)Relacion de docentes primaria (1)
Relacion de docentes primaria (1)
 
Gload1
Gload1Gload1
Gload1
 
PARAPAN AM TORONTO 2015
PARAPAN AM TORONTO 2015PARAPAN AM TORONTO 2015
PARAPAN AM TORONTO 2015
 
OSHA 5400 Joe Doherty jan 2016
OSHA 5400 Joe Doherty jan 2016OSHA 5400 Joe Doherty jan 2016
OSHA 5400 Joe Doherty jan 2016
 
Valois_Dispray
Valois_DisprayValois_Dispray
Valois_Dispray
 
Ficha links de economia
Ficha links de economiaFicha links de economia
Ficha links de economia
 
POSTER FINAL
POSTER FINALPOSTER FINAL
POSTER FINAL
 
Resultados futbol mayor ldds (dom 22 jul 12)
Resultados futbol mayor ldds (dom 22 jul 12)Resultados futbol mayor ldds (dom 22 jul 12)
Resultados futbol mayor ldds (dom 22 jul 12)
 
MIKE CERTIFICATES
MIKE CERTIFICATES MIKE CERTIFICATES
MIKE CERTIFICATES
 
Trasfondo misiones-mundiales-3
Trasfondo misiones-mundiales-3Trasfondo misiones-mundiales-3
Trasfondo misiones-mundiales-3
 
Hyello
HyelloHyello
Hyello
 
13-16
13-1613-16
13-16
 

Similar to Local Coordination in Online Distributed Constraint Optimization Problems

Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
Chandra Meena
 
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
csandit
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
Prabhu Kumar
 
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONGENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
ijaia
 
imple and new optimization algorithm for solving constrained and unconstraine...
imple and new optimization algorithm for solving constrained and unconstraine...imple and new optimization algorithm for solving constrained and unconstraine...
imple and new optimization algorithm for solving constrained and unconstraine...
salam_a
 
Using particle swarm optimization to solve test functions problems
Using particle swarm optimization to solve test functions problemsUsing particle swarm optimization to solve test functions problems
Using particle swarm optimization to solve test functions problems
riyaniaes
 
CS799_FinalReport
CS799_FinalReportCS799_FinalReport
CS799_FinalReport
Abhanshu Gupta
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdf
gadissaassefa
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
SVijaylakshmi
 
J017265860
J017265860J017265860
J017265860
IOSR Journals
 
Database Applications in Analyzing Agents
Database Applications in Analyzing AgentsDatabase Applications in Analyzing Agents
Database Applications in Analyzing Agents
iosrjce
 
Conjoint
ConjointConjoint
Conjoint
putra69
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Elias Hasnat
 
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
IJCNCJournal
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNING
pradiprahul
 
Learning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning
Learning Collaborative Agents with Rule Guidance for Knowledge Graph ReasoningLearning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning
Learning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning
Deren Lei
 
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
ijcsit
 
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
AIRCC Publishing Corporation
 
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
AIRCC Publishing Corporation
 
final paper1
final paper1final paper1
final paper1
Leon Hunter
 

Similar to Local Coordination in Online Distributed Constraint Optimization Problems (20)

Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONGENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
 
imple and new optimization algorithm for solving constrained and unconstraine...
imple and new optimization algorithm for solving constrained and unconstraine...imple and new optimization algorithm for solving constrained and unconstraine...
imple and new optimization algorithm for solving constrained and unconstraine...
 
Using particle swarm optimization to solve test functions problems
Using particle swarm optimization to solve test functions problemsUsing particle swarm optimization to solve test functions problems
Using particle swarm optimization to solve test functions problems
 
CS799_FinalReport
CS799_FinalReportCS799_FinalReport
CS799_FinalReport
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdf
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
J017265860
J017265860J017265860
J017265860
 
Database Applications in Analyzing Agents
Database Applications in Analyzing AgentsDatabase Applications in Analyzing Agents
Database Applications in Analyzing Agents
 
Conjoint
ConjointConjoint
Conjoint
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNING
 
Learning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning
Learning Collaborative Agents with Rule Guidance for Knowledge Graph ReasoningLearning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning
Learning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning
 
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
 
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
 
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
 
final paper1
final paper1final paper1
final paper1
 

Local Coordination in Online Distributed Constraint Optimization Problems

  • 1. Local Coordination in Online Distributed Constraint Optimization Problems Antonio Maria Fiscarelli1 , Robert Vanden Eynde1 and Erman Loci2 1 Ecole Polytechnique, Universite Libre de Bruxelles, Avenue Franklin Roosevelt 50, 1050 Bruxelles 2 Artificial Intelligence Lab, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium afiscare@ulb.ac.be Abstract For agents to achieve a common goal in multi-agent systems, often they need to coordinate. One way to achieve coordiantion is to let agents learn in an joint action space. Joint Action Learning allows agents to take into account the actions of other agents, but with an increased number of agents the action space increases exponentially. If coordiantion between some agents is more important than others, than local coordination allows the agents to coordinate while keeping the complexity low. In this paper we investigate local coordination in which agents learn the problem structure, resulting in a better group performance. Introduction In multi-agent system, agents must coordinate to achieve a jointly optimal payoff. One way to achieve this coordination is to let agents see the actions that other agents chose, and based on those actions, the agents can choose an action that increases the total payoff. This method of learning is called Joint Action Learning (JAL). Increasing the number of agents in JAL, the joint space in which the agents learn increases exponentially (Claus & Boutilier, 1998), because the agents have to see the actions of every other agent. Even though JALs always give the optimal solutions, they are quite complex to calculate. In this paper we introduce a method, the Local Joint Action Learning (LJAL) that solves this complexity problem by sacrificing some of the solution quality. We let agents see the actions of only some of the other agents, only of those that are important, or of those that are necessary. Local Joint Action Learners The LJAL approach relies on the concept of a Coordiantion Graph (CG) (Guestrin, Lagoudakis, & Parr, 2002), a CG describes action dependencies between agents. In a CG vertices represent agents, and egdes represent coordination between those agents (Fig. 1). Fig.1. An example of a Coordination Graph (CG) In LJAL the learning problem can be described as a distributed n-armed bandit problem, where every agent can choose among n actions, and the reward depends on the combination of all chosen actions. Agents estimate rewards according to the following formula (Sutton & Barto, 1998): 𝑄𝑡+1 𝑎 = 𝑄𝑡 + 𝛼[𝑟 𝑡 + 1 − 𝑄𝑡] LJAL also keep a probabilistic model of other agents action selection, they count the number of times C each action has been chosen by each agent. Agent i maintains the frequency 𝐹𝑎 𝑗 𝑖 , that agent j selects action 𝑎𝑗 from its action set 𝐴𝑗 : 𝐹𝑎 𝑗 𝑖 = 𝐶𝑎 𝑗 𝑗 𝐶𝑏 𝑗 𝑗 𝑏 𝑗 ∈𝐴 𝑗 The expected value for selecting a specific action i is calculated as follows: 𝐸𝑉 𝑎𝑖 = 𝑄(𝑎 ∪ {𝑎𝑖}) 𝐹𝑎[𝑗] 𝑖 𝑗 𝑎∈𝐴 𝑖 where 𝐴𝑖 =×𝑗∈𝑁(𝑖) 𝐴𝑗 and N(i) represents the set of neighbors of agent i in the CG. According to Sutton and Barto (1998) the probability that agent i choses action 𝑎𝑖, at time t is: Pr 𝑎𝑖 = 𝑒 𝐸𝑉(𝑎 𝑖)/𝜏 𝑒 𝐸𝑉(𝑏 𝑖)/𝜏𝑛 𝑏 𝑖=1
  • 2. The parameter 𝜏 expresses how greedy the actions are being selected. LJAL performance We will compare IL, LJAL with randomly generated CG with out-degree 1 (LJAL-1) for each agent, LJAL with randomly generated CG with out-degree 2 (LJAL- 2), and LJAL with randomly generated CG with out- degree 3 (LJAL-3). These were evaluated on randomly generated distributed bandit problem, for every possible joint action, a fixed global reward is drawn from a Normal distribution N(0, 70) (70 = 10 * number of agents). A single run of the experiment consist of 200 plays, in which 7 agents chose among 4 actions, and receive a reward for the global joint action as determined by the problem. Every run LJALs get a new random graph with the correspoding out-degree. Agents select their actions with temperature 𝜏 = 1000 ∗ 0.94play . The experiment is averaged over 200 runs (Fig. 2). Fig. 2. Comparing of IL, LJAL We can see that the solution quality for IL is the worst, while with more coordination the reward gets better. This happened because IL only reason about themselves, while LJALs take into consideration the actions of other agents. LJALs are better at the solution quality but the complexity also increases. Distributed Constraint Optimization A Constraint Optimization Problem (COP) describes the problem of assigning values to a set of variables, subject to a number of soft constraints. Solving a COP means maximizing the sum of rewards for every constraint that are associated with assigning a value to each variable. A Distributed Constraint Optimization Problem (DCOP) is a tuple (A, X, D, C, f), where: A = {𝑎1, 𝑎2, … , 𝑎𝑙}, the set of agents X = {𝑥1, 𝑥2, … , 𝑥 𝑛 }, the set of variables D = {𝐷1, 𝐷2, … , 𝐷𝑙}, the set of domains. Variable 𝑥𝑖 can be assigned values from the finite domain 𝐷𝑖. C = {𝑐1, 𝑐2, … , 𝑐 𝑚 }, the set of constraints. Constraint 𝑐𝑖 is a function 𝐷𝑎 × 𝐷𝑏 × … × 𝐷𝑘 → ℝ, with {a,b,…,k} ≤ (subset) {1, 2, … , n}, projecting the domains of a subset of variables onto a real number, being the reward. f: X → A, a function mapping variables onto a single agent The total reward of a variable assignment S, assigning value v(𝑥𝑖) ∈ 𝐷𝑖 to variable 𝑥𝑖, is: 𝐶 𝑆 = 𝑐𝑖(𝑣 𝑥 𝑎 , 𝑣 𝑥 𝑏 , … , 𝑣(𝑥 𝑘 )) 𝑚 𝑖=1 DCOPs are used to model a variety of real problems, ranging from disaster response scenarios (Chapman et al. 2011) and distributed sensor network management ( Kho, Rogers, & Jennings, 2009), to traffic management in congested networks (van Leeuwen, Hesselink, & Rohling, 2002). In a DCOP each constraint has its own reward function, and since the total reward for a solution is the sum of all rewards, some constraints can have a larger impact on the solution quality than others. Therefore coordination between specific agents can be more important than others. We will investigate the performance of LJALs on DCOPs where some constraints are more important than others. We will generate random, fully connected DCOPs, drawing the rewards of every constraint function from different normal distributions. We attach a weight 𝑤𝑖 ∈ [0, 1] to each constraint 𝑐𝑖, the problems variance 𝜎 is multiplied with this weight when the reward function for constraint 𝑐𝑖 is calculated. The rewards for constraint 𝑐𝑖 are drawn from this distribution: N(0, 𝜎𝑤𝑖) In our experiment we will compare different LJAL solving the structure given in Fig. 3. The black edges in Fig. 3 correspond to weights of 0.9, while the gray edges correspond to weights of 0.1. Plays Reward Black – IL Red – LJAL-1 Blue – LJAL-2 Green – LJAL-3
  • 3. Fig. 3. A Weighted CG, darker edges mean more important constraints, lighter edges mean less important constraints. In addition to IL and LJAL with random out-degree 2 (LJAL-1), we compare LJALs with a CG matching the problem structure (LJAL-2), and another LJAL with the same structure as the problem but with an added edge between agents 1 and 5 (LJAL-3). From the results shown below (Fig. 4) we can see that LJAL-2 performs better than LJAL-1, meaning that a LJAL with a CG that corresponds to the problem structure gives better solutions than a LJAL with randomly generated CGs. We can also see that an added coordination between agents 1 and 5 in LJAL-3 doesn’t improve the solution quality. This happens because the extra information on an unimportant constraint complicates the coordination on important constraints. According to Taylor et al. (2011) the increase in teamwork is not necessarily beneficial to solution quality. Fig. 4. Comparing IL and LJAL on a distributed constraint optimization problem. We make another experiment to test the effect the extra coordination edge has on solution quality. We switch LJAL-3 by adding an extra coordination edge between agents 4 and 7, and removing the edge between agents 1 and 5 (Fig. 5). We can see that the extra coordination between agent 4 and 7 improved the solution quality, because agents 4 and 7 were not involved in any important constraint. Fig. 5. The effect of an extra coordination edge on solution quality Learning Coordination Graphs In the previous experiment we have shown that LJAL with the same CG as the problem structure perform better than LJAL with random generated CGs. In the next experiment we will make the LJAL learn the optimal CG. The problem of learning a CG is encoded as a distributed n-armed bandit problem. Each agent can choose at most one or two coordination partners.We map the two-partner selection to an n-armed bandit problem by making actions represent pairs of agents instead of single agents. The coordiantion partners are chosen randomly, and after they are chosen the LJAL solve the learning problem using that graph. The resulting reward is used as feedback for chosing the next coordination partners. This is one play at the meta- learning level. This process is repeated until the CG converges. The agents in this meta-bandit problem are independent learners. In our experiment we make the agents learn the CG as proposed in fig. 3. This way we can compare the learned CG with the known problem structure. One meta-bandit run consist of 500 plays. In each play the chosen CG is evaluated in 10 runs of 200 plays. The average of the reward achieved over 10 runs is the estimated reward for the chosen CG. In Fig. 6 we show a CG that agents learned. The temperature 𝜏 is decreased to 𝜏 = 1000 × 0.994 𝑝𝑙𝑎𝑦 . The results are averaged over 1000 runs. Reward Plays Black – IL Red – LJAL-1 Blue – LJAL-2 Green – LJAL-3 Reward Plays
  • 4. Fig. 6.. This shows that agents can determine which agents are more important to coordinate with, but we have to explain how the agents who learn the graph perform better than those with the same graphs as the problem structure. Agents that do not coordinate directly are independent learners relative to each other. These agents are able to find the optimal reward by climbing, that is each agent in turn change their action (Guestrin, Lagoudakis, & Parr, 2002). The starting point is the highest average reward, and if a global optimal reward can be achieved by climbing from that point, than independent learning is enough to find the optimal reward. Conclusion Given a CG we implement a distributed q-learning algorithm where the agents find the best actions to maximize the total reward. The only information they have is what action the agents that he is coordinating with are taking, and the total reward of their joint action. In the learning of the CG we implement a q-learning algorithm where agents learn the best coordination graph. In this case since it is not distributed the only information they have is the total reward they get playing with the current coordination graph. References Chapman, A. C., Micillo, R. A., Kota, R., & Jennings, N. R. (2009, May). Decentralised dynamic task allocation: a practical game: theoretic approach. In Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems-Volume 2 (pp. 915-922). International Foundation for Autonomous Agents and Multiagent Systems. Claus, C., & Boutilier, C. (1998, July). The dynamics of reinforcement learning in cooperative multiagent systems. In AAAI/IAAI (pp. 746-752). Guestrin, C., Lagoudakis, M., & Parr, R. (2002, July). Coordinated reinforcement learning. In ICML (Vol. 2, pp. 227-234). Kho, J., Rogers, A., & Jennings, N. R. (2009). Decentralized control of adaptive sampling in wireless sensor networks. ACM Transactions on Sensor Networks (TOSN), 5(3), 19. Van Leeuwen, P., Hesselink, H., & Rohling, J. (2002). Scheduling aircraft using constraint satisfaction. Electronic notes in theoretical computer science, 76, 252-268. Sutton, R. S., & Barto, A. G. (1998). Introduction to reinforcement learning. MIT Press. Taylor, M. E., Jain, M., Tandon, P., Yokoo, M., & Tambe, M. (2011). Distributed on-line multi-agent optimization under uncertainty: Balancing exploration and exploitation. Advances in Complex Systems, 14(03), 471-528. 1 2 3 7 6 5 4