A brief overview of Reinforcement Learning applied to games

A brief overview of
Reinforcement Learning
applied to games
Thomas Paula
August 16, 2018 - #10 Porto Alegre Machine Learning Meetup

Who am I?
2RL applied to games
Thomas Paula
● Machine Learning Engineer and Researcher @HP
● Msc in Computer Science
● POA Machine Learning Meetup
● @tsp_thomas
● tsp.thomas@gmail.com

Why study games?
● Simple rules and deep concepts
● Some of them are studied for hundreds or
thousands of years
● Encapsulate real world issues
● Games are fun :)
Source: David Silver, 2015

Agenda
● Introduction
○ Artificial Intelligence
○ Challenge for AI: beat humans in chess
● Reinforcement Learning
● Deep Reinforcement Learning
● Closing thoughts

Artificial Intelligence
Source: Deep Learning (Goodfellow, Bengio, Courville)

Artificial Intelligence
● “The effort to automate intellectual tasks
normally performed by humans"
● Born in 1950s: people trying to make
computers think
● People used to believe human-level artificial
intelligence = hand-crafted set of rules
● 1950s to 1980s: Symbolic AI

Why chess is (was) challenging for computers?
Programming a Computer for Playing Chess
● Seminal paper of Claude Shannon in 1950
● Number of possible positions ~10^120
○ Number of atoms in known universe
estimate: 10^78 to 10^82
● Pure brute force: impossible even for modern
computers

Why chess is (was) challenging for computers?
● Let’s take tic-tac-toe as an example
O X O
X X
O
Source: https://materiaalit.github.io/intro-to-ai-17/part2/
Game Tree

IBM Deep Blue
● Chess-playing computer developed by IBM
● Won first game against Garry Kasparov on 10 February 1996
● Approach based on Symbolic AI
○ Alpha-beta pruning search algorithm
○ Deep Blue executed it in parallel
● Deep Blue won a six-game match, but was accused of
cheating in the last one
● Results
○ Deep Blue was retired
○ Stockfish
Source: https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)

Go
Number of possible positions: ~10^170!
Branching factor: average is 250!

RL applied to games
How can we solve Go?
13
Reinforcement Learning
to the rescue!

What is Reinforcement Learning
● Trial and error (no supervisor)
● Feedback is delayed, not instantaneous
● Time matters (data is not i.i.d.)
● Actions affect next states
Source: Richard Sutton, 2017

Comparison to Supervised/Unsupervised Learning
Supervised Learning
● Set of labeled examples provided by “external supervisor”
● Not applicable to learning from interaction
○ Generally complicated to obtain examples of all situations
Unsupervised Learning
● Usually tries to learn structure/data representation
● Does not exactly match RL: RL wants to maximize a reward

Reinforcement Learning Agent
Policy
A function for the
behavior, which maps
states to actions.
Value function
How good is each state
and/or action
Model
Agent’s representation
of the environment
RL Agent

Markov Decision Process (MDP)
In general
● Mathematical framework for modelling decision
making
● States, actions, and rewards
Relationship with RL
● Formally describe an environment for RL, where
the environment is fully-observable
● Almost all RL problems can be formalized as
MDPs

RL simple example (1)
+1
-1
Environment Possible Policy

RL simple example (2)- Q-learning
Source: https://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_td.html

Examples of RL success in games (prior DL)
Backgammon
TD-Gammon (1992)
Scrabble
Maven (2000s)

What about Atari games?
How to represent complex games in RL scenario?
Can we use Deep Learning to capture information from raw pixels?

Deep Reinforcement Learning
23

Deep Q-Learning (DQN)
● Q-Learning is a tabular method
○ What if it’s the first time we’re visiting an state?
● Can we use a neural network as our Q-function?
○ Yes!
○ However, RL is unstable/diverge when using a nonlinear function
approximator (e.g. a neural network)
● DQN has clever techniques to solve that!
Source: Human-level control through deep reinforcement learning, 2015

DQN - Overview

DQN - Overview
Source: Resource Management with Deep Reinforcement Learning, 2016

DQN - Breakout
Source: https://www.youtube.com/watch?v=TmPfTpjtdgg

DQN

DQN
● Single architecture can successfully learn control policies in a range of different
environments
● Deep network architectures and reinforcement learning
○ Experience replay
○ Target network: made algorithm more stable
● Limitations
○ Games that demand more temporally extended strategies still a great
challenge

Go
Number of possible positions: ~10^170!
Branching factor: average is 250!

AlphaGo
Policy Network Value Network

AlphaGo - Training Pipeline (simplified)
Source: Mastering the game of Go with deep neural networks and tree search, 2016

AlphaGo - Monte Carlo Tree Search (MCTS)
Source: Mastering the game of Go with deep neural networks and tree search, 2016

AlphaGo - Results
● Played against Lee Sedol, in
March 2016
● Lee is has won 18 world titles
● AlphaGo won the match 4-1
AlphaGo Documentary (Netflix)

AlphaZero (as per David Silver’s NIPS talk)
No human data
● Learns based on self-reinforcement learning,
starting from random
No human features
● Only takes raw board as input
Single neural network
● Policy and Value networks are combined
Simplified search
● No Monte Carlo rollouts, uses neural network to
evaluate
Source: 2017 NIPS Keynote by DeepMind's David Silver

AlphaZero (as per David Silver’s NIPS talk)
Source: 2017 NIPS Keynote by DeepMind's David Silver

Dota 2
● Real time strategy (RTS) game
○ Actually a specialization called Multiplayer
online battle arena (MOBA)
● Two teams of five players, where each player
controls a hero
● Main goal is to destroy other opponents “base”
● Lots of challenges for RL

Dota 2 - Challenges for RL
● Long time horizons
○ 30 fps for 45 minutes
● Partially-observed state
○ Part of the map is seen
○ Needs to make inferences with incomplete data
● High-dimensional, continuous action space
○ Space discretized into 170,000 possible actions
○ ~1,000 valid actions in “a moment”
● High-dimensional, continuous observation space
○ State: 20,000 numbers
Source: https://blog.openai.com/openai-five/

Dota 2 - OpenAI Five
● Each hero represented as a 1024-unit
LSTM
● Extracts game state with Valve’s Bot API
● Learns entirely from self-play
● Uses Proximal Policy Optimization
(PPO) for training

Dota 2 - OpenAI Five
● Simplified version of the game (not all heroes, removed some tactics)
● Played against team of 99.95th percentile Dota players
○ Four have played professionally
● 3 games
○ OpenAI Five won 1st and 2nd
○ 3rd: audience was asked to choose the heroes
■ AI predict 2.9% change of winning

OpenAI Five -> Dexterity
● Robot hand that can manipulate physical objects
● Makes use of the same RL algorithm of OpenAI Five

Other examples
Starcraft II Battlefield

Take home message
● Reinforcement Learning is a hot topic
● The combination of RL and Deep Learning is producing great results
● Games are a great proxy for developing solutions for real-world problems
○ Lots of challenges far from being solved
● What about an RL agent that plays against you and improves to tackle your
way of playing?

Thank you!
August 16, 2018 - #10 Porto Alegre Machine Learning Meetup
Thomas Paula
● @tsp_thomas
● tsp.thomas@gmail.com

A brief overview of Reinforcement Learning applied to games

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A brief overview of Reinforcement Learning applied to games

Similar to A brief overview of Reinforcement Learning applied to games (20)

Recently uploaded

Recently uploaded (20)

A brief overview of Reinforcement Learning applied to games