Reinforcement learning with
MarLӦ and Breakout
Ruben Romero, Lahiru Dulanjana Chamain Hewa Gamage,
Roderick Gu, Yuheng Li and Chen Zhou
Agenda
1. Introduction to Marlo
2. Introduction to Breakout
a. DQN
b. Duel-DQN
c. A2C
d. PPO
3. Results comparison
4. Marlo environment
5. Challenges
Game environment : MarLӦ - Find the Goal
• Multi agents RL platform built on Minecraft.
• Goal: Find the box
• Actions
• Forward,Backward,Right,Left
• Turn-right,Turn-left
• Open problem
• No published/online solutions
Game environment : Breakout
• Platform: Atari
• Goal: break bricks
• Actions
• Move-Right
• Move-Left
• Do nothing
• Shoot
From Marlo to Breakout
Marlo has a comparably complex environment
○ 3D objects
■ can not have a whole view of env at a given time
○ Redundant actions
■ Going Backward
○ Sparse reward
■ only gets reward when reaches the goal
○ long resetting time
■ take seconds to reset the env
○ asynchronous environment
■ difficult for parallel training
Q-Learning
Bellman Equation:
Iterative Bellman Equation:
Discounted future reward:
Method 1: DQN
Q-function approximator:
Target for iteration i:
DQN loss function:
Exploit vs Explore
• Problem: The algorithm does not explore all the possible states.
• Solution:
• Use a parameter to decided when to take random action without following
the Q values.
• It help to explore all the possible stages.
epsilon = explorationFactor
if epsilon > randomNumber() then
exploration: chooseRandomAction()
else
exploitation: chooseActionMaxQ-value()
Experience replay
• Problem: Training samples are highly correlated and leads to harder
convergence during the training.
• Solution:
• The sample transitions (s, s’, a, r) are stored .
• The algorithm use random samples inputs from the replay buffer.
• These samples are more independent each other and generate a faster
training.
Separate target network.
• Problem: The target function is update at each iteration and it leads
to a unstable learning.
• Solution:
• Add a second neural network with a similar structure.
• The first network retrieves Q values while the latter includes all updates in
the training.
• Each N iterations both network synchronize
DQN: Breakout: Epsilon Greedy
Average test score: 33
Duel - DQN
Action Advantage
State value
Q values
Advantage function:
How good it is in state s:
Value of choosing a when in s:
Duel-DQN: Breakout: Epsilon Greedy
Average test score: 41
DQN Vs Dueling
A2C-advantage actor critic
Agent receives reward r at time step t and the game continues T steps, total reward :
Agent finished the episode of game in trajectory with the probability p, expected total reward :
Policy-based: Directly optimize policy to maximize the expected total reward
Algo1: Directly take the gradient of expected total reward:
Algo2: use a critic to evaluate the action that policy take
A2C: policy interact with multiple environment in a synchronous way
Our architecture:
CNN head: Alexnet (removed last FC)
RNN: 1 layer GRU
policy: FC: output the greedy action
value: FC:output V value
Result
breakout pong seaquest
Time comparison
total steps rewards time
DQN 1.75M 40 10h
A2C 20M 170 6h
DQN A2C
Network Structure of Proximal Policy Optimization
PPO - Proximal Policy Optimization
Max Reward Change with Training Episodes
Reward
Steps
Marlo: Duel-DQN
• Agent couldn’t learn with Duel-DQN
• Possible reasons:
• Sparsity of the reward
• 90% of the episodes
• Agent doesn’t reach the goal
• Redundant Actions
• Going backward
Marlo: A2C
1 separate rollout for each worker and set shared memory
2 set timer
3 generate fake data while resetting
4 dynamic rollout length
5 repeatedly use rollout data while resetting
Challenges
• Agent can go backward without goal in the observation, in that
case it is not correlated with the reward.
• Training time:
At least 12 hours for single agent for each algorithm
• GPU resources limitation:
2 GPUs: GTX 1080; GTX 1080Ti;
• Parallel environment:
Efficient, but took too much to implement
Conclusion
1. Agent can learn well in Breakout with
a. DQN, Duel-DQN
b. A2C
2. Duel-DQN performs better than DQN
3. A2C is much faster beneficial from parallel environment
4. Marlo has a comparably complex environment
5. Agent couldn’t learn in Marlo with
a. Duel-DQN
b. A2C
Contribution
Ruben Romero: DQN
Lahiru Dulanajana Chamain Hewa Gamage: Duel-DQN
Jing Gu: PPO
Yuheng Li: A2C, Parallel Env
Chen Zhou: A2C
THANK YOU !
Notes:
Iterative Bellman Equation:
DQN loss function:
Target for iteration i:

Reinforcement Learning for Marlo

  • 1.
    Reinforcement learning with MarLӦand Breakout Ruben Romero, Lahiru Dulanjana Chamain Hewa Gamage, Roderick Gu, Yuheng Li and Chen Zhou
  • 2.
    Agenda 1. Introduction toMarlo 2. Introduction to Breakout a. DQN b. Duel-DQN c. A2C d. PPO 3. Results comparison 4. Marlo environment 5. Challenges
  • 3.
    Game environment :MarLӦ - Find the Goal • Multi agents RL platform built on Minecraft. • Goal: Find the box • Actions • Forward,Backward,Right,Left • Turn-right,Turn-left • Open problem • No published/online solutions
  • 4.
    Game environment :Breakout • Platform: Atari • Goal: break bricks • Actions • Move-Right • Move-Left • Do nothing • Shoot
  • 5.
    From Marlo toBreakout Marlo has a comparably complex environment ○ 3D objects ■ can not have a whole view of env at a given time ○ Redundant actions ■ Going Backward ○ Sparse reward ■ only gets reward when reaches the goal ○ long resetting time ■ take seconds to reset the env ○ asynchronous environment ■ difficult for parallel training
  • 6.
    Q-Learning Bellman Equation: Iterative BellmanEquation: Discounted future reward:
  • 7.
    Method 1: DQN Q-functionapproximator: Target for iteration i: DQN loss function:
  • 8.
    Exploit vs Explore •Problem: The algorithm does not explore all the possible states. • Solution: • Use a parameter to decided when to take random action without following the Q values. • It help to explore all the possible stages. epsilon = explorationFactor if epsilon > randomNumber() then exploration: chooseRandomAction() else exploitation: chooseActionMaxQ-value()
  • 9.
    Experience replay • Problem:Training samples are highly correlated and leads to harder convergence during the training. • Solution: • The sample transitions (s, s’, a, r) are stored . • The algorithm use random samples inputs from the replay buffer. • These samples are more independent each other and generate a faster training.
  • 10.
    Separate target network. •Problem: The target function is update at each iteration and it leads to a unstable learning. • Solution: • Add a second neural network with a similar structure. • The first network retrieves Q values while the latter includes all updates in the training. • Each N iterations both network synchronize
  • 11.
    DQN: Breakout: EpsilonGreedy Average test score: 33
  • 12.
    Duel - DQN ActionAdvantage State value Q values Advantage function: How good it is in state s: Value of choosing a when in s:
  • 13.
    Duel-DQN: Breakout: EpsilonGreedy Average test score: 41
  • 14.
  • 15.
    A2C-advantage actor critic Agentreceives reward r at time step t and the game continues T steps, total reward : Agent finished the episode of game in trajectory with the probability p, expected total reward : Policy-based: Directly optimize policy to maximize the expected total reward
  • 16.
    Algo1: Directly takethe gradient of expected total reward: Algo2: use a critic to evaluate the action that policy take
  • 17.
    A2C: policy interactwith multiple environment in a synchronous way Our architecture: CNN head: Alexnet (removed last FC) RNN: 1 layer GRU policy: FC: output the greedy action value: FC:output V value
  • 18.
  • 19.
    Time comparison total stepsrewards time DQN 1.75M 40 10h A2C 20M 170 6h DQN A2C
  • 20.
    Network Structure ofProximal Policy Optimization PPO - Proximal Policy Optimization
  • 22.
    Max Reward Changewith Training Episodes Reward Steps
  • 23.
    Marlo: Duel-DQN • Agentcouldn’t learn with Duel-DQN • Possible reasons: • Sparsity of the reward • 90% of the episodes • Agent doesn’t reach the goal • Redundant Actions • Going backward
  • 24.
    Marlo: A2C 1 separaterollout for each worker and set shared memory 2 set timer 3 generate fake data while resetting 4 dynamic rollout length 5 repeatedly use rollout data while resetting
  • 25.
    Challenges • Agent cango backward without goal in the observation, in that case it is not correlated with the reward. • Training time: At least 12 hours for single agent for each algorithm • GPU resources limitation: 2 GPUs: GTX 1080; GTX 1080Ti; • Parallel environment: Efficient, but took too much to implement
  • 26.
    Conclusion 1. Agent canlearn well in Breakout with a. DQN, Duel-DQN b. A2C 2. Duel-DQN performs better than DQN 3. A2C is much faster beneficial from parallel environment 4. Marlo has a comparably complex environment 5. Agent couldn’t learn in Marlo with a. Duel-DQN b. A2C
  • 27.
    Contribution Ruben Romero: DQN LahiruDulanajana Chamain Hewa Gamage: Duel-DQN Jing Gu: PPO Yuheng Li: A2C, Parallel Env Chen Zhou: A2C THANK YOU !
  • 28.
    Notes: Iterative Bellman Equation: DQNloss function: Target for iteration i: