A Society of AI Agents

A Society of AI Agents
(群体智能的社会)
Jun Wang
UCL
CCF-GAIR 2017, Shenzhen, China

1st
ranking UK
university
for research strength
(Research Excellence Framework, 2014)
29Nobel Prize
winners
3 Fields Medal winners
(14 non-UK)
1,107
professors
(1/3 non-UK)
12,403academic and
professional
services staff
(4,875 non-UK)
Only UK
university
awarded both
(institutional)
By 1878, first English
university to admit female
students on equal terms
with men
Our founding principles -
academic
excellence and research
addressing real-world
problems - continue today
伦敦大学学院

Reinforcement Learning （强化学习）
State St+1
An AI Agent
（智能体）
Environment
（环境）
Action at
Reward rt+1
Optimal action policy a* <--- Maximise r1 + r2 +…+ rt +…

Multi-agent Reinforcement Learning
（多智能体的强化学习）
Environment
（环境）
An AI Agent
（智能体）
State St+1
An AI Agent
（智能体）
Action at
Reward rt+1
StateSt+1
Actionat
Rewardrt+1
An AI Agent
（智能体）
State St+1
Action at
Reward rt+1

MARL Application: Machine Bidding in Online Advertising
Advertiser
with ad budget
Environment
auction result，
user response
bid request
xt+1
bid request xt bid price at
The goal is to maximise the user responses on displayed ads
Cai, H., K. Ren, W. Zhag, K. Malialis, and J. Wang. "Real-Time Bidding by Reinforcement Learning in Display Advertising."
In The Tenth ACM International Conference on Web Search and Data Mining (WSDM). ACM, 2017.
e 9: Overall performance on iPinYou under
05
and di↵erent budget conditions.
he generalization ability of the neural network is sat-
ry only in small scales. For relatively large scales,
neralization of RLB-NN is not reliable. (iii) Com-
to RLB-NN, the 3 sophisticated algorithms RLB-
eg, RLB-NN-MapD and RLB-NN-MapA are more
and outperform Lin under every budget condition.
do not rely on the generalization ability of the approx-
n model, therefore their performance is more stable.
esults clearly demonstrate that they are e↵ective so-
s for the large-scale problem. (iv) As for eCPC, all
s except from Mcpc are very close, thus making the
sed RLB algorithms practically e↵ective.
ONLINE DEPLOYMENT AND A/B TEST
proposed RLB model is deployed and tested in a live
nment provided by Vlion DSP. The deployment envi-
nt is based on HP ProLiant DL360p Gen8 servers. A
e cluster is utilized for the bidding agent, where each
Figure 10: Online A/
0 50 100 150 200 250
Episodes
0
50
100
150
200
250
300
350
400
TotalClicks
Lin
RLB
Figure 11: Total clicks a
episodes.
10, while the click and cost pe
shown in Figure 11.
From the comparison, we obs
the same cost, RLB achieves l
thus more total clicks, which s
of RLB. (ii) RLB provides bet
imation performance of the neural
iPinYou YOYI
(⇥10 6) 0.998 1.263
✓avg (⇥10 4) 9.404 11.954
all performance on iPinYou under
Lin RLB
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Bids (⇥106
)
Lin RLB
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Impressions (⇥105
)
Lin RLB
0
50
100
150
200
250
300
350
400
Total Clicks
Lin RLB
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
CTR (%)
Lin RLB
0.0
0.2
0.4
0.6
0.8
1.0
1.2
CPM (CNY)
Lin RLB
0.0
0.2
0.4
0.6
0.8
1.0
1.2
eCPC (CNY)
Figure 10: Online A/B testing results.

MARL Application: AI plays StarCraft
Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, Jun Wang, Multiagent Bidirectionally-
Coordinated Nets for Learning to Play StarCraft Combat Games, 2017

Multi-agent Reinforcement learning
Environment
（环境）
An AI Agent
（智能体）
State St+1
An AI Agent
（智能体）
Action at
Reward rt+1
StateSt+1
Actionat
Rewardrt+1
An AI Agent
（智能体）
State St+1
Action at
Reward rt+1
Problem 1: current research is limited only
to less than 20 agents

Million-agent application: the Uber/DD case

Million-agent application: 共享单车
case

Ecology: self-organisation
• In many real-world populations, certain regularity and
order from collective behaviors may exist everywhere
• The theory of self-organisation: driven by repeated
interactions and rules between local individuals that
are initially disordered
• Limitation: without invoking individual complexity.
David JT Sumpter. The principles of collective animal behaviour. Philosophical Transactions of the
Royal Society of London B: Biological Sciences , 361(1465):5–22, 2006.

Ecology: the Lotka-Volterra (LV) model
• A major topic of population dynamics is the
cycling of predator and prey populations
• The Lotka-Volterra model is used to model this
• 猞猁（lynx）和兔子（hare）
Lotka, A. J. (1910). "Contribution to the Theory of Periodic Reaction". J. Phys.
Chem. 14 (3): 271–274.

Predator Prey Obstacle Health ID Group1 Group2
2
1
3 4
6
1
2
3 4
6
3
Timestep t Timestep t+1
5
5
Artificial Population: Large-scale predator-prey world
The setting:
• Predators hunt the
prey so as to survive
from starvation.
• Each predator has its
own health bar and
eyesight view.
• Predators can form a
group to hunt the prey
• Predators are scaled
up to 1 million
Yaodong Yang , Lantao Yu , Yiwei Bai , Jun Wang , Weinan Zhang , Ying Wen , Yong Yu, , Dynamics of Artificial Populations by Million-agent
Reinforcement Learning, 2017

Reinforcement Learning with 1 millions agents
Q-network
Experience
Buffer
(Obs, ID)
Q-value
(Obs, ID)
Q-value
(Obs, ID)
Q-value
(Obs, ID)
Q-value
(st, at, rt, st+1)
updates action
ID embedding
action
reward
action
reward
reward
...
...
(st, at, rt, st+1)
(st, at, rt, st+1)
1
2
3 4
6 5
Figure 2: Million-agent Q-learning in Predator-prey World.
borating with others. We keep alternating the environments by feeding these two
after another (axiom 6) and examine the dynamics of grouping behaviors. To emYaodong Yang , Lantao Yu , Yiwei Bai , Jun Wang , Weinan Zhang , Ying Wen , Yong Yu, , Dynamics of Artificial Populations by Million-agent
⌘(⇡) = E⌧ [ t=0
t
R(st, at)], where ⌧ = (s0, a0, r0, s1, ...) denotes the trajectory s
nt’s policy at ⇠ ⇡✓(at|st), initial state distribution s0 ⇠ ⇢0(s0), and state transition f
T (st+1|st, at). For each agent, its goal is to learn ✓⇤
:= argmax✓⌘(⇡).
e Setting of Many-agent Q-Learning
ulti-agent setting, agents share a common Q-value function approximated by a deep
Qt(si
t, ai
t) = Qt((Oi
t, vi
), ai
t). We refer more details of DQN to [21]. In this gam
le to let the whole population share the same network, as biologically speaking, ind
es tend to inherit the same characteristics from their ancestors [32, 33, 34], therefo
ssume that the intelligence level of each predator is the same. However, note that apa
rent observations for each agent, the Q-network also takes a real-valued identity v
hich incorporates the individual uniqueness into decision making process. Conside
on in the action space, we apply ✏-greedy methods on selecting the action, ⇡✓(at|
Qi
t(si
t, ai
t)). At each timestep, all agents contribute its experienced transitions (si
t, ai
t, r
ffer, as shown in Fig. 2. Based on experience sampled from the buffer, the shared Q-n
d as:
Q(si
t, ai
t) Q(si
t, ai
t) + ↵[ri
t + max
a02A
Q(si
t+1, a0
) Q(si
t, ai
t)].
periments and Findings
The action space A: {move forward, backward, left, right, rotate left, rotate right, stand still,
join a group, and leave a group}.

What happens if disable the predators’
learning ability
• Predators fail to adapt to the new environment
• The artificial ecosystem collapses quickly

The Dynamics of the Artificial Population

Tiger-sheep-rabbit: Grouping

Multi-agent Reinforcement learning
Environment
（环境）
An AI Agent
（智能体）
State St+1
An AI Agent
（智能体）
Action at
Reward rt+1
StateSt+1
Actionat
Rewardrt+1
An AI Agent
（智能体）
State St+1
Action at
Reward rt+1
Problem 1: current research is limited only
to less than 20 agents
Problem 2: the environment is assumed to
be given and not designable

Learning to design shopping space
A. Penn. The complexity of the elementary interface: shopping space. In Proceedings to the
5th International Space Syntax Symposium , volume 1, pages 25–42. Akkelies van Nes, 2005.
https://www.youtube.com/watch?v=NkePRXxH9D4
IKEA: designing
shopping space
to impulse
customers
purchase and
long stay

Learning to design shopping space
A. Penn. The complexity of the elementary interface: shopping space. In Proceedings to the
5th International Space Syntax Symposium , volume 1, pages 25–42. Akkelies van Nes, 2005.
https://www.youtube.com/watch?v=NkePRXxH9D4

Learning to Design Map for Parcel-sorting Bots
https://www.youtube.com/watch?v=_QndP_PCRSw

Learning to Design Environments (Games)
• We consider the environment is controllable
and strategic
• A mini-max game between the agent and the
environment
Haifeng Zhang, Jun Wang , Zhiming Zhou , Weinan Zhang , Ying Wen , Wenxin Li, Learning
to Design Games: Strategic Environments in Deep Reinforcement Learning, 2017
1. Generate
Environments
2. Each environment
trains an agent
3. Operate in the
environments with
4. Agent return
G ...G1 6
Agent
πµ
θ
θ
A
Environment
Generator
M
ϕw
θ1
A
M θ2
A
M θ3
A
M
θ4
A
M θ5
A
M θ6
A
M
respectively...πϕ1
πϕ6
generator update
guide the
1: Framework dealing with non-differentiable transitions. Generator generates environmen
ter ✓. For each ✓, agents are trained until optimal policies are obtained. Then agents are teste
esponding environments and returns are observed, which ﬁnally guide the generator to updat
olution for Undifferentiable Transition
gh we have proved the equivalence between the transition optimization and the policy o
In this paper, we consider a particular objective of MDP that the MDP acts as an83
environment minimizing the expected return of the agent, i.e. O(H) =
P1
t=1
t
84
Thus, the objective function is formulated as:85
✓⇤
= arg min
✓
max E[G|⇡ ; M✓ = hS, A, P✓, R, i].
This adversarial objective can be applied to design environments to analyse the weakness86
and its policy learning algorithms.87

Controllable Environments: An example
• Maze:
• The AI agent: try to find an optimal strategy to find
the way out.
• The environment: generate a Maze to make it
difficult to find a way

Learning to design Environments (Games)

Learning to Design Maze: Results
DFS
DQNOptimal
RHS

Machine vs Human
Intelligence
“...even if these artifacts (machines)
perform certain acts better than us, they
would do it without the conscience of
them... ...it is morally impossible that a
machine will work in all the
circumstances of life in the same way as
our reason makes us work”.
—Descartes, René (1596–1650), French
philosopher, mathematician, and man of
science.

A Society of AI Agents

Recommended

Recommended

More Related Content

Similar to A Society of AI Agents

Similar to A Society of AI Agents (20)

More from Jun Wang

More from Jun Wang (11)

Recently uploaded

Recently uploaded (20)

A Society of AI Agents