Discrete Sequential Prediction of Continuous
Actions for Deep RL
Luke Metz∗, Julian Ibarz, James Davidson - Google Brain
Navdeep Jaitly - NVIDIA Research
presented by Jie-Han Chen
(under review as a conference paper at ICLR 2018)
My current challenge
The action space of pysc2 is complicated
Different type of actions need different parameters
Outline
● Introduction
● Method
● Experiments
● Discussion
Introduction
Two kinds of action space
● Discrete action space
● Continuous action space
Introduction - Continuous action space
action 1 action 2
action 3
action 4 action 5
action 6
Introduction - Continuous action space
Could be solved well using policy gradient-based
algorithm (NOT value-based) a1 a2 a3
Introduction - Discrete action space
Introduction - Discrete action space
a1, Q(s, a1)
a2, Q(s, a2)
a3, Q(s, a3)
a4, Q(s, a4)
Introduction - Discretized continuous action
If we want to use discrete action
method to solve continuous
action problem, we need to
discretize continuous action
value.
Introduction - Discretized continuous action
If we split continuous angle by
3.6°
For 1-D action, there are 100
output neurons !
0°
3.6°
7.2°
Introduction - Discretized continuous action
For 2-D action, we need
10000 neuron to cover all
combination.
10000
neurons !
Introduction
In this paper, they focus on:
● Off-policy algorithm
● Value-based method (Usually cannot solve continuous action problem)
They want to transform DQN able to solve continuous action problem
Method
● Inspired by sequence to sequence model
● They call this method: SDQN (S means sequential)
● Output 1-D action at one step
○ reduce N-D actions selection to a series of 1-D action selection problem
Method - seq2seq
Method - seq2seq
Method - seq2seq
a1
a1 a2a2
a3
a3
St
St + a1
St + a1 + a2
Q1 Q2 Q3
Does it make sense?
Define agent-environment boundary
Before defining the set of state, we should define the
boundary between agent and environment.
According to Richard Sutton’s textbook:
1. “The agent-environment boundary represents the
limit of the agent’s absolute control, not of its
knowledge.”
2. “The general rule we follow is that anything
cannot be changed arbitrarily by the agent is
considered to be outside of it and thus part of its
environment.”
18
Method - seq2seq
a1
a1 a2a2
a3
a3
St
St + a1
St + a1 + a2
Q1 Q2 Q3
agents
Method - transformed MDP
Origin MDP -> Inner MDP + Outer MDP
input:
St + a1
input:
St + a1 + a2
input:
St
Method - transformed MDP
input:
St + a1
input:
St + a1 + a2
input:
St
Method - Action Selection
Method - Training
There are 3 update stages
● Outer MDP (St -> St+1)
● Inner Q need to update by Q-Learning
● The last inner Q need to match Q(St+1)
Method - Training
There are 3 update stages
● Outer MDP (St -> St+1)
● Inner Q need to update by Q-Learning
● The last inner Q need to match Q(St+1)
Method - Training
There are 3 update stages
● Outer MDP (St -> St+1)
● Inner Q need to update by Q-Learning
● The last inner Q need to match Q(St+1)
Method - Training
There are 3 update stages
● Outer MDP (St -> St+1)
● Inner Q need to update by Q-Learning
● The last inner Q need to match Q(St+1)
Method - Training
There are 3 update stages
● Outer MDP (St -> St+1)
● Inner Q need to update by Q-Learning
● The last inner Q need to match Q(St+1)
Method - Training
2 kinds of neural network (may not actually 2, depends on implementation)
● Outer Q network is denoted by
● Inner Q network is denoted by
○ The i-th dimension action value
○ The last inner Q is denoted by
Method - Training
2 kinds of neural network (may not actually 2, depends on implementation)
● Outer Q network is denoted by
● Inner Q network is denoted by
○ The i-th dimension action value
○ The last inner Q is denoted by
Method - Training
Update Outer Q network
● Outer Q network is denoted by
○ Just used to evaluate state-action value, not select actions
○ Update by bellman equation
Method - Training
Update inner Q network
● Inner Q network is denoted by
○ The i-th dimension action value
○ Update by Q Learning:
Method - Training
Update the last inner Q network
● Inner Q network is denoted by
○ The last inner Q network is denoted by
○ Update by regression
Implementation of
1. Recurrent share weights, using LSTM
a. input: state + previous selected action (NOT )
2. Multiple separate feedforward models
a. input: state + concatenated selected action
b. more stable than upper one
Method - Exploration
Experiments
● Multimodal Example Environment
○ Compared with other state-of-the-art model and test its effetiveness
○ DDPG: state-of-the-art off-policy actor critic algorithm
○ NAF: another value-based algorithm could solve continuous action problem
● Mujoco environments
○ Test SDQN on common continuous control tasks
○ 5 tasks
Experiments - Multimodal Example Environment
1. Single step MDP
a. only 2 state: initial step and terminal state
2. Deterministic environment
a. fixed transition
3. 2-D action space (2 continuous action)
4. Multimodal distribution reward function
a. used to test the algorithm converge to local optimal or global optimal?
Experiments - Multimodal Example Environment
reward
final policy
Experiments - Multimodal Example Environment
Experiments - MuJoCo environments
● hopper (3-D action)
● swimmer (2-D)
● half cheetah (6-D)
● walker2D (6-D)
● humanoid (17-D)
Experiments - MuJoCo environments
● hopper (3-D action)
● swimmer (2-D)
● half cheetah (6-D)
● walker2D (6-D)
● humanoid (17-D)
Experiments - MuJoCo environments
● Perform hyper parameter search, select the best one to evaluate
performance
● Run 10 random seeds for each environments
Experiments - MuJoCo environments
Experiments - MuJoCo environments
Training for 2M steps
The value is average best performance (10 random seeds)
Recap DeepMind pysc2
The Network Architecture
Recap DeepMind pysc2
Discussion

Discrete sequential prediction of continuous actions for deep RL

  • 1.
    Discrete Sequential Predictionof Continuous Actions for Deep RL Luke Metz∗, Julian Ibarz, James Davidson - Google Brain Navdeep Jaitly - NVIDIA Research presented by Jie-Han Chen (under review as a conference paper at ICLR 2018)
  • 2.
    My current challenge Theaction space of pysc2 is complicated Different type of actions need different parameters
  • 3.
    Outline ● Introduction ● Method ●Experiments ● Discussion
  • 4.
    Introduction Two kinds ofaction space ● Discrete action space ● Continuous action space
  • 5.
    Introduction - Continuousaction space action 1 action 2 action 3 action 4 action 5 action 6
  • 6.
    Introduction - Continuousaction space Could be solved well using policy gradient-based algorithm (NOT value-based) a1 a2 a3
  • 7.
  • 8.
    Introduction - Discreteaction space a1, Q(s, a1) a2, Q(s, a2) a3, Q(s, a3) a4, Q(s, a4)
  • 9.
    Introduction - Discretizedcontinuous action If we want to use discrete action method to solve continuous action problem, we need to discretize continuous action value.
  • 10.
    Introduction - Discretizedcontinuous action If we split continuous angle by 3.6° For 1-D action, there are 100 output neurons ! 0° 3.6° 7.2°
  • 11.
    Introduction - Discretizedcontinuous action For 2-D action, we need 10000 neuron to cover all combination. 10000 neurons !
  • 12.
    Introduction In this paper,they focus on: ● Off-policy algorithm ● Value-based method (Usually cannot solve continuous action problem) They want to transform DQN able to solve continuous action problem
  • 13.
    Method ● Inspired bysequence to sequence model ● They call this method: SDQN (S means sequential) ● Output 1-D action at one step ○ reduce N-D actions selection to a series of 1-D action selection problem
  • 14.
  • 15.
  • 16.
    Method - seq2seq a1 a1a2a2 a3 a3 St St + a1 St + a1 + a2 Q1 Q2 Q3
  • 17.
  • 18.
    Define agent-environment boundary Beforedefining the set of state, we should define the boundary between agent and environment. According to Richard Sutton’s textbook: 1. “The agent-environment boundary represents the limit of the agent’s absolute control, not of its knowledge.” 2. “The general rule we follow is that anything cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment.” 18
  • 19.
    Method - seq2seq a1 a1a2a2 a3 a3 St St + a1 St + a1 + a2 Q1 Q2 Q3 agents
  • 20.
    Method - transformedMDP Origin MDP -> Inner MDP + Outer MDP input: St + a1 input: St + a1 + a2 input: St
  • 21.
    Method - transformedMDP input: St + a1 input: St + a1 + a2 input: St
  • 22.
    Method - ActionSelection
  • 23.
    Method - Training Thereare 3 update stages ● Outer MDP (St -> St+1) ● Inner Q need to update by Q-Learning ● The last inner Q need to match Q(St+1)
  • 24.
    Method - Training Thereare 3 update stages ● Outer MDP (St -> St+1) ● Inner Q need to update by Q-Learning ● The last inner Q need to match Q(St+1)
  • 25.
    Method - Training Thereare 3 update stages ● Outer MDP (St -> St+1) ● Inner Q need to update by Q-Learning ● The last inner Q need to match Q(St+1)
  • 26.
    Method - Training Thereare 3 update stages ● Outer MDP (St -> St+1) ● Inner Q need to update by Q-Learning ● The last inner Q need to match Q(St+1)
  • 27.
    Method - Training Thereare 3 update stages ● Outer MDP (St -> St+1) ● Inner Q need to update by Q-Learning ● The last inner Q need to match Q(St+1)
  • 28.
    Method - Training 2kinds of neural network (may not actually 2, depends on implementation) ● Outer Q network is denoted by ● Inner Q network is denoted by ○ The i-th dimension action value ○ The last inner Q is denoted by
  • 29.
    Method - Training 2kinds of neural network (may not actually 2, depends on implementation) ● Outer Q network is denoted by ● Inner Q network is denoted by ○ The i-th dimension action value ○ The last inner Q is denoted by
  • 30.
    Method - Training UpdateOuter Q network ● Outer Q network is denoted by ○ Just used to evaluate state-action value, not select actions ○ Update by bellman equation
  • 31.
    Method - Training Updateinner Q network ● Inner Q network is denoted by ○ The i-th dimension action value ○ Update by Q Learning:
  • 32.
    Method - Training Updatethe last inner Q network ● Inner Q network is denoted by ○ The last inner Q network is denoted by ○ Update by regression
  • 33.
    Implementation of 1. Recurrentshare weights, using LSTM a. input: state + previous selected action (NOT ) 2. Multiple separate feedforward models a. input: state + concatenated selected action b. more stable than upper one
  • 34.
  • 35.
    Experiments ● Multimodal ExampleEnvironment ○ Compared with other state-of-the-art model and test its effetiveness ○ DDPG: state-of-the-art off-policy actor critic algorithm ○ NAF: another value-based algorithm could solve continuous action problem ● Mujoco environments ○ Test SDQN on common continuous control tasks ○ 5 tasks
  • 36.
    Experiments - MultimodalExample Environment 1. Single step MDP a. only 2 state: initial step and terminal state 2. Deterministic environment a. fixed transition 3. 2-D action space (2 continuous action) 4. Multimodal distribution reward function a. used to test the algorithm converge to local optimal or global optimal?
  • 37.
    Experiments - MultimodalExample Environment reward final policy
  • 38.
    Experiments - MultimodalExample Environment
  • 39.
    Experiments - MuJoCoenvironments ● hopper (3-D action) ● swimmer (2-D) ● half cheetah (6-D) ● walker2D (6-D) ● humanoid (17-D)
  • 40.
    Experiments - MuJoCoenvironments ● hopper (3-D action) ● swimmer (2-D) ● half cheetah (6-D) ● walker2D (6-D) ● humanoid (17-D)
  • 41.
    Experiments - MuJoCoenvironments ● Perform hyper parameter search, select the best one to evaluate performance ● Run 10 random seeds for each environments
  • 42.
  • 43.
    Experiments - MuJoCoenvironments Training for 2M steps The value is average best performance (10 random seeds)
  • 44.
    Recap DeepMind pysc2 TheNetwork Architecture
  • 45.
  • 46.