Minecraft and Reinforcement Learning

#MinecraftRL @choas
Minecraft and Reinforcement Learning
Lars Gregori
@choas
labs.hybris.com

#MinecraftRL @choas
Minecraft
Markus "Notch" Persson
Mojang AB
Best-selling PC game of all time
Exploration
Resource gathering
Crafting
Combat
Sandbox construction game
Creative + building aspects
Three-dimensional environment

#MinecraftRL @choas
Project Malmo
Open Source (Github)
Microsoft Research Lab
Based on
Minecraft / Minecraft Forge
Agents written in
Python, Lua, C++, C#,
Java, Torch, ALE*
Mission XML
WorldState
Send Command
*Arcade Learning Environment

#MinecraftRL @choas
“The Project Malmo platform is designed to
support a wide range of experimentation
needs and can support research in robotics,
computer vision, reinforcement learning,
planning, multi-agent systems, and related
areas”The Malmo Platform for Artificial Intelligence Experimentation. Proc. 25th International Joint Conference on Artificial Intelligence
Project Malmo

#MinecraftRL @choas
Reinforcement Learning
Supervised
Learning
Unsupervised
Learning
Reinforcement
Learning

#MinecraftRL @choas
Observation Reward Action
Environment
Agent

#MinecraftRL @choas
“Reinforcement learning is like
trial-and-error learning.”David Silver

#MinecraftRL @choas
Reinforcement Learning: An Introduction
Richard S. Sutton and Andrew G. Barto 
(1998)
Cliff Walking Example
Reward:
-1 per move
100 blue field
-100 lava field

#MinecraftRL @choas
Reinforcement Learning Demo

#MinecraftRL @choas
Q-Learning

#MinecraftRL @choas
Q-Learning
ALPHA = 1.0 ### step-size parameter
GAMMA = 0.8 ### discount-rate parameter
old_q = q_table[prev_state][prev_action]
max_q = max(q_table[current_state][:])
new_q = old_q + ALPHA * (reward + GAMMA * max_q - old_q)

#MinecraftRL @choas
Q-Learning
old_q = 0.0

#MinecraftRL @choas
Q-Learning
old_q = 0.0
max_q = 0.0

#MinecraftRL @choas
Q-Learning
100  
-1
old_q = 0.0
max_q = 0.0
new_q = old_q + ALPHA * (99.0 + GAMMA * max_q - old_q)

#MinecraftRL @choas
Q-Learning
100  
-1
old_q = 0.0
max_q = 0.0
new_q = 0.0 + 1.0 * (99.0 + 0.8 * 0.0 - 0.0)

#MinecraftRL @choas
Q-Learning
100  
-1
99.0
old_q = 0.0
max_q = 0.0
new_q = 99.0

#MinecraftRL @choas
Q-Learning
100  
-1
99.0

#MinecraftRL @choas
Q-Learning
100  
-1
99.0
ALPHA = 1.0
GAMMA = 0.8
old_q = q_table[prev_state][prev_action]

#MinecraftRL @choas
Q-Learning
100  
-1
99.0
ALPHA = 1.0
GAMMA = 0.8
old_q = -1.0

#MinecraftRL @choas
Q-Learning
100  
-1
99.0
ALPHA = 1.0
GAMMA = 0.8
old_q = -1.0
max_q = 99.0

#MinecraftRL @choas
Q-Learning
100  
-1
99.0
ALPHA = 1.0
GAMMA = 0.8
old_q = -1.0
max_q = 99.0
new_q = old_q + ALPHA * (-1.0 + GAMMA * max_q - old_q)

#MinecraftRL @choas
Q-Learning
100  
-1
99.0
ALPHA = 1.0
GAMMA = 0.8
old_q = -1.0
max_q = 99.0
new_q = old_q + ALPHA * (-1.0 + 0.8 * 99.0 - old_q)

#MinecraftRL @choas
Q-Learning
100  
-1
99.0
ALPHA = 1.0
GAMMA = 0.8
old_q = -1.0
max_q = 99.0
new_q = old_q + ALPHA * (-1.0 + 79.2 - old_q)

#MinecraftRL @choas
Q-Learning
100  
-1
99.0
ALPHA = 1.0
GAMMA = 0.8
old_q = -1.0
max_q = 99.0
new_q = -1.0 + 1.0 * (-1.0 + 79.2 - -1.0)

#MinecraftRL @choas
Q-Learning
100  
-1
99.0
ALPHA = 1.0
GAMMA = 0.8
old_q = -1.0
max_q = 99.0
new_q = -1.0 + 1.0 * (-1.0 + 79.2 + 1.0)

#MinecraftRL @choas
Q-Learning
100  
-1
99.0
78.2
ALPHA = 1.0
GAMMA = 0.8
old_q = -1.0
max_q = 99.0
new_q = 78.2

#MinecraftRL @choas
[99 0 0 0] [ 0 -1 -1 0] [ 0 0 L 0]
[ L -1 -1 -1] [-1 -1 -1 -1] [-1 0 0 0]
[ L -1 -1 -1] [-1 -1 -1 -1] [-1 L 0 0]
[ L L -2 -1] [-2 -2 L -1]
[ L -2 -2 -2] [-2 -2 L L]
[ L -3 -2 L] [-2 -3 -2 -2] [-2 -3 L -2]
[ L L -3 L] [-3 L -3 -3] [-3 L -3 -3] [-2 L L -3]
Q Table
L = Lava
[ ← ↓ → ↑ ]

#MinecraftRL @choas
[99 0 0 0] [ 0 -1 -1 0] [ 0 0 L 0]
[ L -1 -1 78] [-1 -1 -1 -1] [-1 0 0 0]
[ L -1 -1 -1] [-1 -1 -1 -1] [-1 L 0 0]
[ L L -2 -1] [-2 -2 L -1]
[ L -2 -2 -2] [-2 -2 L L]
[ L -3 -2 L] [-2 -3 -2 -2] [-2 -3 L -2]
[ L L -3 L] [-3 L -3 -3] [-3 L -3 -3] [-2 L L -3]
Q Table
L = Lava
[ ← ↓ → ↑ ]

#MinecraftRL @choas
[99 0 0 0] [ 0 -1 -1 0] [ 0 0 L -1]
[ L -1 -1 78] [61 -1 -1 -1] [-1 -1 L -1]
[ L -2 -2 61] [-2 -1 -1 -1] [-1 L L -1]
[ L L -2 -2] [-2 -3 L -2]
[ L -2 -3 -2] [-3 -2 L L]
[ L -3 -3 L] [-3 -3 -3 -3] [-2 -3 L -3]
[ L L -3 L] [-3 L -3 -3] [-3 L -3 -3] [-3 L L -3]
Q Table
L = Lava
[ ← ↓ → ↑ ]

#MinecraftRL @choas
[99 0 0 0] [ 0 -1 -1 0] [ 0 0 L -1]
[ L -1 -1 78] [61 -1 -1 -1] [-1 -1 L -1]
[ L -2 -2 61] [-2 -1 -1 -1] [-1 L L -1]
[ L L -2 48] [-2 -3 L -2]
[ L -2 -3 -2] [-3 -2 L L]
[ L -3 -3 L] [-3 -3 -3 -3] [-3 -3 L -3]
[ L L -3 L] [-3 L -3 -3] [-3 L -3 -3] [-3 L L -3]
Q Table
L = Lava
[ ← ↓ → ↑ ]

#MinecraftRL @choas
[99 0 0 0] [78 -1 -1 0] [-1 -1 L -1]
[ L -1 -1 78] [61 -1 -1 -1] [48 -1 L -1]
[ L -2 -2 61] [-2 -2 -2 48] [-1 L L -1]
[ L L -2 48] [-2 -3 L 37]
[ L -3 -3 -2] [-3 -3 L L]
[ L -3 -3 L] [-3 -3 -3 -3] [-3 -3 L -3]
[ L L -3 L] [-3 L -3 -3] [-3 L -3 -3] [-3 L L -3]
Q Table
L = Lava
[ ← ↓ → ↑ ]

#MinecraftRL @choas
[99 0 0 0] [78 -1 -1 0] [-1 -1 L -1]
[ L -1 -1 78] [61 -1 -1 -1] [48 -1 L -1]
[ L -2 -2 61] [-2 -2 -2 48] [-1 L L -1]
[ L L -2 48] [-2 -3 L 37]
[ L -3 -3 29] [-3 -3 L L]
[ L -4 -3 L] [-3 -3 -3 -3] [-3 -3 L -3]
[ L L -4 L] [-3 L -3 -3] [-3 L -3 -3] [-3 L L -3]
Q Table
L = Lava
[ ← ↓ → ↑ ]

#MinecraftRL @choas
[99 0 0 0] [78 -1 -1 0] [-1 -1 L -1]
[ L -1 -1 78] [61 -1 -1 -1] [48 -1 L -1]
[ L -2 -2 61] [-2 -2 -2 48] [-1 L L -1]
[ L L -2 48] [-2 -3 L 37]
[ L -3 -3 29] [-3 -3 L L]
[ L -4 -3 L] [-3 -3 -3 22] [-3 -3 L -3]
[ L L -4 L] [-3 L -3 -3] [-3 L -3 -3] [-3 L L -3]
Q Table
L = Lava
[ ← ↓ → ↑ ]

#MinecraftRL @choas
[99 0 0 0] [78 -1 -1 0] [-1 -1 L -1]
[ L -1 -1 78] [61 -1 -1 -1] [48 -1 L -1]
[ L -2 -2 61] [-2 -2 -2 48] [-1 L L -1]
[ L L -2 48] [-2 -3 L 37]
[ L -3 -3 29] [-3 -3 L L]
[ L -4 16 L] [-3 -3 -3 22] [-3 -3 L -3]
[ L L -4 L] [-4 L -3 -3] [-3 L -3 -3] [-3 L L -3]
Q Table
L = Lava
[ ← ↓ → ↑ ]

#MinecraftRL @choas
[99 0 0 0] [78 -1 -1 0] [-1 -1 L -1]
[ L -1 -1 78] [61 -1 -1 -1] [48 -1 L -1]
[ L -2 -2 61] [-2 -2 -2 48] [-1 L L -1]
[ L L -2 48] [-2 -3 L 37]
[ L -3 -3 29] [-3 -3 L L]
[ L -4 16 L] [-3 -3 -3 22] [-3 -3 L -3]
[ L L -4 L] [-4 L -3 12] [-3 L -3 16] [-3 L L -3]
Q Table
L = Lava
[ ← ↓ → ↑ ]

#MinecraftRL @choas
[99 0 0 0] [78 -1 -1 0] [-1 -1 L -1]
[ L -1 -1 78] [61 -1 -1 -1] [48 -1 L -1]
[ L -2 -2 61] [-2 -2 -2 48] [-1 L L -1]
[ L L -2 48] [-2 -3 L 37]
[ L -3 -3 29] [-3 -3 L L]
[ L -4 16 L] [-3 -3 -3 22] [-3 -3 L -3]
[ L L 8 L] [-4 L -3 12] [-3 L -3 16] [-3 L L -3]
Q Table
L = Lava
[ ← ↓ → ↑ ]

#MinecraftRL @choas
[99 0 0 0] [78 -1 -1 0] [-1 -1 L -1]
[ L -1 -1 78] [61 -1 -1 -1] [48 -1 L -1]
[ L -2 -2 61] [-2 -2 -2 48] [-1 L L -1]
[ L L -2 48] [-2 -3 L 37]
[ L -3 -3 29] [-3 -3 L L]
[ L -4 16 L] [-3 -3 -3 22] [-3 -3 L -3]
[ L L 8 L] [-4 L -3 12] [-3 L -3 16] [-3 L L -3]
ALPHA = 1.0 GAMMA = 0.8

#MinecraftRL @choas
[99 48 0 L] [48 0 0 0] [-1 0 L 0]
[ L 0 -1 97] [96 -1 -1 -1] [-1 -1 L -1]
[ L -1 -1 -1] [-1 -1 -1 92] [-1 L L -1]
[ L L -2 -1] [-2 -2 L 83]
[ L -3 -3 74] [-2 -4 L L]
[ L -5 -2 L] [-4 -4 -4 55] [-4 -4 L -4]
[ L L -1 L] [-6 L 11 -5] [-5 L -5 31] [-5 L L -4]
ALPHA = 0.5 GAMMA = 1.0 (40 moves)

#MinecraftRL @choas
[99 48 0 L] [48 0 0 0] [-1 0 L 0]
[ L 0 -1 97] [96 -1 -1 -1] [-1 -1 L -1]
[ L -1 -1 47] [-2 -1 -1 95] [-1 L L -1]
[ L L -2 -1] [-2 45 L 94]
[ L -3 -3 93] [-2 -4 L L]
[ L -5 -2 L] [-4 -4 -4 92] [-4 -4 L -4]
[ L L 88 L] [-6 L 90 -5] [-5 L -5 91] [-5 L L -4]
ALPHA = 0.5 GAMMA = 1.0 (60 moves)

#MinecraftRL @choas
Deep Reinforcement Learning
Supervised
Learning
Unsupervised
Learning
Reinforcement
Learning

#MinecraftRL @choas
Playing Atari with Deep Reinforcement Learning (arXiv:1312.5602)
https://youtu.be/TmPfTpjtdgg

#MinecraftRL @choas
12 Classes

#MinecraftRL @choas
### based on arXiv:1312.5602 (page 6) 
 
model = Sequential() 
model.add(Conv2D(16, (8, 8), strides=(4, 4), input_shape=input_shape)) 
model.add(Activation('relu')) 
model.add(Conv2D(32, (4, 4), strides=(2, 2))) 
model.add(Activation(‘relu'))
model.add(Flatten()) 
model.add(Dense(256)) 
model.add(Activation('relu'))
model.add(Dense(12, activation=‘sigmoid')) # 12 classes / actions
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
Keras Model

#MinecraftRL @choas
Deep Reinforcement Learning Demo

Links
The Malmo Platform for Artificial Intelligence Experimentation. Proc. 25th International Joint
Conference on Artificial Intelligence http://www.ijcai.org/Proceedings/2016
Project Malmo https://www.microsoft.com/en-us/research/project/project-malmo/
Project Malmo (Github) https://github.com/Microsoft/malmo
Reinforcement Learning: An Introduction - ISBN-13: 978-0262193986 
2nd Version online
 
YouTube RL Course by David Silver

#MinecraftRL @choas
Thank you. Hi Lars …

No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP SE or an SAP affiliate company.
The information contained herein may be changed without prior notice. Some software products marketed by SAP SE and its distributors contain proprietary software components  
of other software vendors. National product specifications may vary.
These materials are provided by SAP SE or an SAP affiliate company for informational purposes only, without representation or warranty of any kind, and SAP or its affiliated
companies shall not be liable for errors or omissions with respect to the materials. The only warranties for SAP or SAP affiliate company products and services are those that are  
set forth in the express warranty statements accompanying such products and services, if any. Nothing herein should be construed as constituting an additional warranty.
In particular, SAP SE or its affiliated companies have no obligation to pursue any course of business outlined in this document or any related presentation, or to develop or release
any functionality mentioned therein. This document, or any related presentation, and SAP SE’s or its affiliated companies’ strategy and possible future developments, products, and/
or platform directions and functionality are all subject to change and may be changed by SAP SE or its affiliated companies at any time for any reason without notice. The information
in this document is not a commitment, promise, or legal obligation to deliver any material, code, or functionality. All forward-looking statements are subject to various  
risks and uncertainties that could cause actual results to differ materially from expectations. Readers are cautioned not to place undue reliance on these forward-looking statements,
and they should not be relied upon in making purchasing decisions.
SAP and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP SE (or an SAP affiliate company)  
in Germany and other countries. All other product and service names mentioned are the trademarks of their respective companies.  
See http://global.sap.com/corporate-en/legal/copyright/index.epx for additional trademark information and notices.
© 2018 SAP SE or an SAP affiliate company. All rights reserved.

Minecraft and Reinforcement Learning

More Related Content

More from Lars Gregori

Recently uploaded

Minecraft and Reinforcement Learning