Reinforcement Learning for Marlo

Reinforcement learning with
MarLӦ and Breakout
Ruben Romero, Lahiru Dulanjana Chamain Hewa Gamage,
Roderick Gu, Yuheng Li and Chen Zhou

Agenda
1. Introduction to Marlo
2. Introduction to Breakout
a. DQN
b. Duel-DQN
c. A2C
d. PPO
3. Results comparison
4. Marlo environment
5. Challenges

Game environment : MarLӦ - Find the Goal
• Multi agents RL platform built on Minecraft.
• Goal: Find the box
• Actions
• Forward,Backward,Right,Left
• Turn-right,Turn-left
• Open problem
• No published/online solutions

Game environment : Breakout
• Platform: Atari
• Goal: break bricks
• Actions
• Move-Right
• Move-Left
• Do nothing
• Shoot

From Marlo to Breakout
Marlo has a comparably complex environment
○ 3D objects
■ can not have a whole view of env at a given time
○ Redundant actions
■ Going Backward
○ Sparse reward
■ only gets reward when reaches the goal
○ long resetting time
■ take seconds to reset the env
○ asynchronous environment
■ difficult for parallel training

Q-Learning
Bellman Equation:
Iterative Bellman Equation:
Discounted future reward:

Method 1: DQN
Q-function approximator:
Target for iteration i:
DQN loss function:

Exploit vs Explore
• Problem: The algorithm does not explore all the possible states.
• Solution:
• Use a parameter to decided when to take random action without following
the Q values.
• It help to explore all the possible stages.
epsilon = explorationFactor
if epsilon > randomNumber() then
exploration: chooseRandomAction()
else
exploitation: chooseActionMaxQ-value()

Experience replay
• Problem: Training samples are highly correlated and leads to harder
convergence during the training.
• Solution:
• The sample transitions (s, s’, a, r) are stored .
• The algorithm use random samples inputs from the replay buffer.
• These samples are more independent each other and generate a faster
training.

Separate target network.
• Problem: The target function is update at each iteration and it leads
to a unstable learning.
• Solution:
• Add a second neural network with a similar structure.
• The first network retrieves Q values while the latter includes all updates in
the training.
• Each N iterations both network synchronize

DQN: Breakout: Epsilon Greedy
Average test score: 33

Duel - DQN
Action Advantage
State value
Q values
Advantage function:
How good it is in state s:
Value of choosing a when in s:

Duel-DQN: Breakout: Epsilon Greedy
Average test score: 41

A2C-advantage actor critic
Agent receives reward r at time step t and the game continues T steps, total reward :
Agent finished the episode of game in trajectory with the probability p, expected total reward :
Policy-based: Directly optimize policy to maximize the expected total reward

Algo1: Directly take the gradient of expected total reward:
Algo2: use a critic to evaluate the action that policy take

A2C: policy interact with multiple environment in a synchronous way
Our architecture:
CNN head: Alexnet (removed last FC)
RNN: 1 layer GRU
policy: FC: output the greedy action
value: FC:output V value

Time comparison
total steps rewards time
DQN 1.75M 40 10h
A2C 20M 170 6h
DQN A2C

Network Structure of Proximal Policy Optimization
PPO - Proximal Policy Optimization

Max Reward Change with Training Episodes
Reward
Steps

Marlo: Duel-DQN
• Agent couldn’t learn with Duel-DQN
• Possible reasons:
• Sparsity of the reward
• 90% of the episodes
• Agent doesn’t reach the goal
• Redundant Actions
• Going backward

Marlo: A2C
1 separate rollout for each worker and set shared memory
2 set timer
3 generate fake data while resetting
4 dynamic rollout length
5 repeatedly use rollout data while resetting

Challenges
• Agent can go backward without goal in the observation, in that
case it is not correlated with the reward.
• Training time:
At least 12 hours for single agent for each algorithm
• GPU resources limitation:
2 GPUs: GTX 1080; GTX 1080Ti;
• Parallel environment:
Efficient, but took too much to implement

Conclusion
1. Agent can learn well in Breakout with
a. DQN, Duel-DQN
b. A2C
2. Duel-DQN performs better than DQN
3. A2C is much faster beneficial from parallel environment
4. Marlo has a comparably complex environment
5. Agent couldn’t learn in Marlo with
a. Duel-DQN
b. A2C

Contribution
Ruben Romero: DQN
Lahiru Dulanajana Chamain Hewa Gamage: Duel-DQN
Jing Gu: PPO
Yuheng Li: A2C, Parallel Env
Chen Zhou: A2C
THANK YOU !

Notes:
Iterative Bellman Equation:
DQN loss function:
Target for iteration i:

Reinforcement Learning for Marlo

More Related Content

Similar to Reinforcement Learning for Marlo

Recently uploaded

Reinforcement Learning for Marlo