The first part of the talk will cover our latest research on 1-million-agent reinforcement learning and its potential applications. Our findings show that the dynamics of the population from AI agents, driven by reinforcement learning and self-interest, share a similar pattern as those found in Nature. At the second part of the talk I shall move to a reinforcement learning setting where the game environment is strategic and designable. We present a simple case on how to design a difficult Maze, but the techniques can be used for various applications where the system level objectives are inconsistent with agents’ goals. I will finally conclude the talk by pointing out the future direction on this exciting field of AI.
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
A Society of AI Agents
1. A Society of AI Agents
(群体智能的社会)
Jun Wang
UCL
CCF-GAIR 2017, Shenzhen, China
2. 1st
ranking UK
university
for research strength
(Research Excellence Framework, 2014)
29Nobel Prize
winners
3 Fields Medal winners
(14 non-UK)
1,107
professors
(1/3 non-UK)
12,403academic and
professional
services staff
(4,875 non-UK)
Only UK
university
awarded both
(institutional)
By 1878, first English
university to admit female
students on equal terms
with men
Our founding principles -
academic
excellence and research
addressing real-world
problems - continue today
伦敦大学学院
5. MARL Application: Machine Bidding in Online Advertising
Advertiser
with ad budget
Environment
auction result,
user response
bid request
xt+1
bid request xt bid price at
The goal is to maximise the user responses on displayed ads
Cai, H., K. Ren, W. Zhag, K. Malialis, and J. Wang. "Real-Time Bidding by Reinforcement Learning in Display Advertising."
In The Tenth ACM International Conference on Web Search and Data Mining (WSDM). ACM, 2017.
e 9: Overall performance on iPinYou under
05
and di↵erent budget conditions.
he generalization ability of the neural network is sat-
ry only in small scales. For relatively large scales,
neralization of RLB-NN is not reliable. (iii) Com-
to RLB-NN, the 3 sophisticated algorithms RLB-
eg, RLB-NN-MapD and RLB-NN-MapA are more
and outperform Lin under every budget condition.
do not rely on the generalization ability of the approx-
n model, therefore their performance is more stable.
esults clearly demonstrate that they are e↵ective so-
s for the large-scale problem. (iv) As for eCPC, all
s except from Mcpc are very close, thus making the
sed RLB algorithms practically e↵ective.
ONLINE DEPLOYMENT AND A/B TEST
proposed RLB model is deployed and tested in a live
nment provided by Vlion DSP. The deployment envi-
nt is based on HP ProLiant DL360p Gen8 servers. A
e cluster is utilized for the bidding agent, where each
Figure 10: Online A/
0 50 100 150 200 250
Episodes
0
50
100
150
200
250
300
350
400
TotalClicks
Lin
RLB
Figure 11: Total clicks a
episodes.
10, while the click and cost pe
shown in Figure 11.
From the comparison, we obs
the same cost, RLB achieves l
thus more total clicks, which s
of RLB. (ii) RLB provides bet
imation performance of the neural
iPinYou YOYI
(⇥10 6) 0.998 1.263
✓avg (⇥10 4) 9.404 11.954
all performance on iPinYou under
Lin RLB
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Bids (⇥106
)
Lin RLB
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Impressions (⇥105
)
Lin RLB
0
50
100
150
200
250
300
350
400
Total Clicks
Lin RLB
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
CTR (%)
Lin RLB
0.0
0.2
0.4
0.6
0.8
1.0
1.2
CPM (CNY)
Lin RLB
0.0
0.2
0.4
0.6
0.8
1.0
1.2
eCPC (CNY)
Figure 10: Online A/B testing results.
6. MARL Application: AI plays StarCraft
Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, Jun Wang, Multiagent Bidirectionally-
Coordinated Nets for Learning to Play StarCraft Combat Games, 2017
8. Multi-agent Reinforcement learning
(多智能体的强化学习)
Environment
(环境)
An AI Agent
(智能体)
State St+1
An AI Agent
(智能体)
Action at
Reward rt+1
StateSt+1
Actionat
Rewardrt+1
An AI Agent
(智能体)
State St+1
Action at
Reward rt+1
Problem 1: current research is limited only
to less than 20 agents
14. Predator Prey Obstacle Health ID Group1 Group2
2
1
3 4
6
1
2
3 4
6
3
Timestep t Timestep t+1
5
5
Artificial Population: Large-scale predator-prey world
The setting:
• Predators hunt the
prey so as to survive
from starvation.
• Each predator has its
own health bar and
eyesight view.
• Predators can form a
group to hunt the prey
• Predators are scaled
up to 1 million
Yaodong Yang , Lantao Yu , Yiwei Bai , Jun Wang , Weinan Zhang , Ying Wen , Yong Yu, , Dynamics of Artificial Populations by Million-agent
Reinforcement Learning, 2017
15. Reinforcement Learning with 1 millions agents
Q-network
Experience
Buffer
(Obs, ID)
Q-value
(Obs, ID)
Q-value
(Obs, ID)
Q-value
(Obs, ID)
Q-value
(st, at, rt, st+1)
updates action
ID embedding
action
reward
action
reward
reward
...
...
(st, at, rt, st+1)
(st, at, rt, st+1)
1
2
3 4
6 5
Figure 2: Million-agent Q-learning in Predator-prey World.
borating with others. We keep alternating the environments by feeding these two
after another (axiom 6) and examine the dynamics of grouping behaviors. To emYaodong Yang , Lantao Yu , Yiwei Bai , Jun Wang , Weinan Zhang , Ying Wen , Yong Yu, , Dynamics of Artificial Populations by Million-agent
Reinforcement Learning, 2017
⌘(⇡) = E⌧ [ t=0
t
R(st, at)], where ⌧ = (s0, a0, r0, s1, ...) denotes the trajectory s
nt’s policy at ⇠ ⇡✓(at|st), initial state distribution s0 ⇠ ⇢0(s0), and state transition f
T (st+1|st, at). For each agent, its goal is to learn ✓⇤
:= argmax✓⌘(⇡).
e Setting of Many-agent Q-Learning
ulti-agent setting, agents share a common Q-value function approximated by a deep
Qt(si
t, ai
t) = Qt((Oi
t, vi
), ai
t). We refer more details of DQN to [21]. In this gam
le to let the whole population share the same network, as biologically speaking, ind
es tend to inherit the same characteristics from their ancestors [32, 33, 34], therefo
ssume that the intelligence level of each predator is the same. However, note that apa
rent observations for each agent, the Q-network also takes a real-valued identity v
hich incorporates the individual uniqueness into decision making process. Conside
on in the action space, we apply ✏-greedy methods on selecting the action, ⇡✓(at|
Qi
t(si
t, ai
t)). At each timestep, all agents contribute its experienced transitions (si
t, ai
t, r
ffer, as shown in Fig. 2. Based on experience sampled from the buffer, the shared Q-n
d as:
Q(si
t, ai
t) Q(si
t, ai
t) + ↵[ri
t + max
a02A
Q(si
t+1, a0
) Q(si
t, ai
t)].
periments and Findings
The action space A: {move forward, backward, left, right, rotate left, rotate right, stand still,
join a group, and leave a group}.
16. What happens if disable the predators’
learning ability
• Predators fail to adapt to the new environment
• The artificial ecosystem collapses quickly
Yaodong Yang , Lantao Yu , Yiwei Bai , Jun Wang , Weinan Zhang , Ying Wen , Yong Yu, , Dynamics of Artificial Populations by Million-agent
Reinforcement Learning, 2017
19. Multi-agent Reinforcement learning
(多智能体的强化学习)
Environment
(环境)
An AI Agent
(智能体)
State St+1
An AI Agent
(智能体)
Action at
Reward rt+1
StateSt+1
Actionat
Rewardrt+1
An AI Agent
(智能体)
State St+1
Action at
Reward rt+1
Problem 1: current research is limited only
to less than 20 agents
Problem 2: the environment is assumed to
be given and not designable
20. Learning to design shopping space
A. Penn. The complexity of the elementary interface: shopping space. In Proceedings to the
5th International Space Syntax Symposium , volume 1, pages 25–42. Akkelies van Nes, 2005.
https://www.youtube.com/watch?v=NkePRXxH9D4
IKEA: designing
shopping space
to impulse
customers
purchase and
long stay
21. Learning to design shopping space
A. Penn. The complexity of the elementary interface: shopping space. In Proceedings to the
5th International Space Syntax Symposium , volume 1, pages 25–42. Akkelies van Nes, 2005.
https://www.youtube.com/watch?v=NkePRXxH9D4
23. Learning to Design Environments (Games)
• We consider the environment is controllable
and strategic
• A mini-max game between the agent and the
environment
Haifeng Zhang, Jun Wang , Zhiming Zhou , Weinan Zhang , Ying Wen , Wenxin Li, Learning
to Design Games: Strategic Environments in Deep Reinforcement Learning, 2017
1. Generate
Environments
2. Each environment
trains an agent
3. Operate in the
environments with
4. Agent return
G ...G1 6
Agent
πµ
θ
θ
A
Environment
Generator
M
ϕw
θ1
A
M θ2
A
M θ3
A
M
θ4
A
M θ5
A
M θ6
A
M
respectively...πϕ1
πϕ6
generator update
guide the
1: Framework dealing with non-differentiable transitions. Generator generates environmen
ter ✓. For each ✓, agents are trained until optimal policies are obtained. Then agents are teste
esponding environments and returns are observed, which finally guide the generator to updat
olution for Undifferentiable Transition
gh we have proved the equivalence between the transition optimization and the policy o
In this paper, we consider a particular objective of MDP that the MDP acts as an83
environment minimizing the expected return of the agent, i.e. O(H) =
P1
t=1
t
84
Thus, the objective function is formulated as:85
✓⇤
= arg min
✓
max E[G|⇡ ; M✓ = hS, A, P✓, R, i].
This adversarial objective can be applied to design environments to analyse the weakness86
and its policy learning algorithms.87