Call Girls In Mahipalpur O9654467111 Escorts Service
Discrete sequential prediction of continuous actions for deep RL
1. Discrete Sequential Prediction of Continuous
Actions for Deep RL
Luke Metz∗, Julian Ibarz, James Davidson - Google Brain
Navdeep Jaitly - NVIDIA Research
presented by Jie-Han Chen
(under review as a conference paper at ICLR 2018)
2. My current challenge
The action space of pysc2 is complicated
Different type of actions need different parameters
9. Introduction - Discretized continuous action
If we want to use discrete action
method to solve continuous
action problem, we need to
discretize continuous action
value.
10. Introduction - Discretized continuous action
If we split continuous angle by
3.6°
For 1-D action, there are 100
output neurons !
0°
3.6°
7.2°
11. Introduction - Discretized continuous action
For 2-D action, we need
10000 neuron to cover all
combination.
10000
neurons !
12. Introduction
In this paper, they focus on:
● Off-policy algorithm
● Value-based method (Usually cannot solve continuous action problem)
They want to transform DQN able to solve continuous action problem
13. Method
● Inspired by sequence to sequence model
● They call this method: SDQN (S means sequential)
● Output 1-D action at one step
○ reduce N-D actions selection to a series of 1-D action selection problem
18. Define agent-environment boundary
Before defining the set of state, we should define the
boundary between agent and environment.
According to Richard Sutton’s textbook:
1. “The agent-environment boundary represents the
limit of the agent’s absolute control, not of its
knowledge.”
2. “The general rule we follow is that anything
cannot be changed arbitrarily by the agent is
considered to be outside of it and thus part of its
environment.”
18
23. Method - Training
There are 3 update stages
● Outer MDP (St -> St+1)
● Inner Q need to update by Q-Learning
● The last inner Q need to match Q(St+1)
24. Method - Training
There are 3 update stages
● Outer MDP (St -> St+1)
● Inner Q need to update by Q-Learning
● The last inner Q need to match Q(St+1)
25. Method - Training
There are 3 update stages
● Outer MDP (St -> St+1)
● Inner Q need to update by Q-Learning
● The last inner Q need to match Q(St+1)
26. Method - Training
There are 3 update stages
● Outer MDP (St -> St+1)
● Inner Q need to update by Q-Learning
● The last inner Q need to match Q(St+1)
27. Method - Training
There are 3 update stages
● Outer MDP (St -> St+1)
● Inner Q need to update by Q-Learning
● The last inner Q need to match Q(St+1)
28. Method - Training
2 kinds of neural network (may not actually 2, depends on implementation)
● Outer Q network is denoted by
● Inner Q network is denoted by
○ The i-th dimension action value
○ The last inner Q is denoted by
29. Method - Training
2 kinds of neural network (may not actually 2, depends on implementation)
● Outer Q network is denoted by
● Inner Q network is denoted by
○ The i-th dimension action value
○ The last inner Q is denoted by
30. Method - Training
Update Outer Q network
● Outer Q network is denoted by
○ Just used to evaluate state-action value, not select actions
○ Update by bellman equation
31. Method - Training
Update inner Q network
● Inner Q network is denoted by
○ The i-th dimension action value
○ Update by Q Learning:
32. Method - Training
Update the last inner Q network
● Inner Q network is denoted by
○ The last inner Q network is denoted by
○ Update by regression
33. Implementation of
1. Recurrent share weights, using LSTM
a. input: state + previous selected action (NOT )
2. Multiple separate feedforward models
a. input: state + concatenated selected action
b. more stable than upper one
35. Experiments
● Multimodal Example Environment
○ Compared with other state-of-the-art model and test its effetiveness
○ DDPG: state-of-the-art off-policy actor critic algorithm
○ NAF: another value-based algorithm could solve continuous action problem
● Mujoco environments
○ Test SDQN on common continuous control tasks
○ 5 tasks
36. Experiments - Multimodal Example Environment
1. Single step MDP
a. only 2 state: initial step and terminal state
2. Deterministic environment
a. fixed transition
3. 2-D action space (2 continuous action)
4. Multimodal distribution reward function
a. used to test the algorithm converge to local optimal or global optimal?
41. Experiments - MuJoCo environments
● Perform hyper parameter search, select the best one to evaluate
performance
● Run 10 random seeds for each environments