Discrete sequential prediction of continuous actions for deep RL

Discrete Sequential Prediction of Continuous
Actions for Deep RL
Luke Metz∗, Julian Ibarz, James Davidson - Google Brain
Navdeep Jaitly - NVIDIA Research
presented by Jie-Han Chen
(under review as a conference paper at ICLR 2018)

My current challenge
The action space of pysc2 is complicated
Different type of actions need different parameters

Outline
● Introduction
● Method
● Experiments
● Discussion

Introduction
Two kinds of action space
● Discrete action space
● Continuous action space

Introduction - Continuous action space
action 1 action 2
action 3
action 4 action 5
action 6

Introduction - Continuous action space
Could be solved well using policy gradient-based
algorithm (NOT value-based) a1 a2 a3

Introduction - Discrete action space

Introduction - Discrete action space
a1, Q(s, a1)
a2, Q(s, a2)
a3, Q(s, a3)
a4, Q(s, a4)

Introduction - Discretized continuous action
If we want to use discrete action
method to solve continuous
action problem, we need to
discretize continuous action
value.

If we split continuous angle by
3.6°
For 1-D action, there are 100
output neurons !
0°
3.6°
7.2°

For 2-D action, we need
10000 neuron to cover all
combination.
10000
neurons !

Introduction
In this paper, they focus on:
● Off-policy algorithm
● Value-based method (Usually cannot solve continuous action problem)
They want to transform DQN able to solve continuous action problem

Method
● Inspired by sequence to sequence model
● They call this method: SDQN (S means sequential)
● Output 1-D action at one step
○ reduce N-D actions selection to a series of 1-D action selection problem

Method - seq2seq
a1
a1 a2a2
a3
a3
St
St + a1
St + a1 + a2
Q1 Q2 Q3

Define agent-environment boundary
Before defining the set of state, we should define the
boundary between agent and environment.
According to Richard Sutton’s textbook:
1. “The agent-environment boundary represents the
limit of the agent’s absolute control, not of its
knowledge.”
2. “The general rule we follow is that anything
cannot be changed arbitrarily by the agent is
considered to be outside of it and thus part of its
environment.”
18

Method - seq2seq
a1
a1 a2a2
a3
a3
St
St + a1
St + a1 + a2
Q1 Q2 Q3
agents

Method - transformed MDP
Origin MDP -> Inner MDP + Outer MDP
input:
St + a1
input:
St + a1 + a2
input:
St

Method - transformed MDP
input:
St + a1
input:
St + a1 + a2
input:
St

Method - Training
There are 3 update stages
● Outer MDP (St -> St+1)
● Inner Q need to update by Q-Learning
● The last inner Q need to match Q(St+1)

Method - Training
2 kinds of neural network (may not actually 2, depends on implementation)
● Outer Q network is denoted by
● Inner Q network is denoted by
○ The i-th dimension action value
○ The last inner Q is denoted by

Method - Training
Update Outer Q network
● Outer Q network is denoted by
○ Just used to evaluate state-action value, not select actions
○ Update by bellman equation

Method - Training
Update inner Q network
○ The i-th dimension action value
○ Update by Q Learning:

Method - Training
Update the last inner Q network
○ The last inner Q network is denoted by
○ Update by regression

Implementation of
1. Recurrent share weights, using LSTM
a. input: state + previous selected action (NOT )
2. Multiple separate feedforward models
a. input: state + concatenated selected action
b. more stable than upper one

Experiments
● Multimodal Example Environment
○ Compared with other state-of-the-art model and test its effetiveness
○ DDPG: state-of-the-art off-policy actor critic algorithm
○ NAF: another value-based algorithm could solve continuous action problem
● Mujoco environments
○ Test SDQN on common continuous control tasks
○ 5 tasks

Experiments - Multimodal Example Environment
1. Single step MDP
a. only 2 state: initial step and terminal state
2. Deterministic environment
a. fixed transition
3. 2-D action space (2 continuous action)
4. Multimodal distribution reward function
a. used to test the algorithm converge to local optimal or global optimal?

reward
final policy

Experiments - MuJoCo environments
● hopper (3-D action)
● swimmer (2-D)
● half cheetah (6-D)
● walker2D (6-D)
● humanoid (17-D)

● Perform hyper parameter search, select the best one to evaluate
performance
● Run 10 random seeds for each environments

Training for 2M steps
The value is average best performance (10 random seeds)

Recap DeepMind pysc2
The Network Architecture

Discrete sequential prediction of continuous actions for deep RL

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Discrete sequential prediction of continuous actions for deep RL

Similar to Discrete sequential prediction of continuous actions for deep RL (20)

More from Jie-Han Chen

More from Jie-Han Chen (6)

Recently uploaded

Recently uploaded (20)

Discrete sequential prediction of continuous actions for deep RL