HIERARCHICAL DECISION MAKING
USING SPATIO-TEMPORAL ABSTRACTION
REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMES
by S.K.Ramnandan
Deep Learning Group, IIT-Madras
How to emulate intelligent behaviour?
Spatial abstraction - by ignoring irrelevant sensory input
Group sets of primitive states in MDP into abstract states
Temporal abstraction - by ignoring fine grain details of actions
Extended actions directly take agent from one abstract state to another
Identify useful skills
Motivation for spatial abstraction:
Find regions of state space that are well-connected - abstract
states
Idea from conformal dynamics - metastability:
Particles stay in same region of state space for long periods of
time without external stimulus
Behaviour under random walks
Identified using spectral clustering algorithm - PCCA+
Construct Laplacian of transition matrix corresponding to random walk on
underlying MDP
Spectra of the Laplacian encodes the properties of underlying 

graph
Vertices of a simplex which lie on the transformed basis are the abstract states
States are classified to abstract states based on their membership to clusters after
projection
Advantages:
Degree of membership of states to each abstract state
Connectivity information between abstract states
Automatically estimate number of abstract states
PCCA+
Use partitions of state space into abstract states along with membership function returned by PCCA+ to
compose options for free
Thus, use the structural information obtained to define behavioral policies for the subtasks independent of
the task being solved
Hence these skills may work even for platform games where rewards are hugely delayed 

TEMPORAL ABSTRACTION: OPTIONS
Option policy to go from abstract state 1 to 2 in 3-room domain
No access to a model of the MDP
Have to estimate transition matrix from sampled trajectories
Underlying policy while sampling cannot be random since
exploration of MDP heavily depends on near-optimal policy
ONLINE AGENT FOR PLATFORM GAMES
Trajectories Featurization
Dimensionality
Reduction
Clustering
Fitting Markov
State Model
PCCA+
Exponential state space - 25352 possible states
22 x 16 tiled grid with 25 possible values
Higher-level state representation than pixel space
FEATURIZATION
12 possible primitive actions
Rewards for achieving ‘side’ goals, such as gathering coins and
killing monsters
MARIO DOMAIN
After featurization, dimensionality of state vector = 240
For 10,000 trajectories, time taken to cluster & fit MSM:
Curse of dimensionality, local feature relevance problem
Reduced dimension representation learning:
Deep Q-Network
Autoencoder (Denoising)
Stacked denoising autoencoder
DIMENSIONALITY REDUCTION
1-D 3-D 240-D
15 min 307 min ?
DQN
RL presents challenges from a deep learning perspective
No direct association between inputs and targets - RL algorithms must be
able to learn from a scalar reward signal that is frequently sparse, noisy
and delayed
Correlated data - In RL, encounter sequences of highly correlated data
Non-stationary training distribution - Problematic for deep learning
methods that assume a fixed underlying distribution
Neural network trained with TD-error acts as non-linear function
approximator for action-values
Experience replay mechanism - randomly samples previous transitions
(s-a-r-s’) from replay pool
116 8x8
filters
32 4x4
filters
Fully connected
hidden layer
Fully connected
output layer
84x84x4
input
• Deriving an approximate state representation
• Compress last hidden layer to simulate encoder in auto encoders
• Summarize state by values of neurons in last hidden layer
• In case of Mario where input is not in pixel space, replaced convolution
layers with fully connected layers
Note:

Contractive nature
of reduced dimension
as training epochs
increases
AUTOENCODER
Cross-entropy error for binary inputs
Directly using loss function for ordinal data inputs ?
AUTOENCODER (DENOISING)
• Is representation learnt from autoencoder useful enough?
• Further constraints need to be applied to attempt to separate useful information from noise
• Will naturally translate to non-zero reconstruction error
• Two implicit underlying ideas:
• A higher level representation should be rather stable and robust under corruptions of the input
• Performing the denoising task well requires extracting features that capture useful structure in
the input distribution
VISULAZATION OF REDUCED DIMENSION
DQN 1-d DQN 2-d DQN 3-d
Auto 1-d Auto 2-d Auto 3-d
VISULAZATION OF REDUCED DIMENSION
dAuto 1-d dAuto 2-d dAuto 3-d
Auto 1-d Auto 2-d Auto 3-d
25%noise0%noise
RECONSTRUCTION ERROR
Reduced
dimension
Auto
dAuto
(25% noise)
h-1 200.559 177.456
h-2 168.765 158.984
h-3 158.751 151.514
h-5 156.246 139.845
dAuto
Auto
Fall in training cost smoother
for denoising autoencoder
END-TO-ENDTESTING RESULTS FOR STATE APPROXIMATION
• Average % increase in return per episode: 15.3%
• Average % decrease in time spent per episode: 4.39%
END-TO-ENDTESTING RESULTS FOR STATE APPROXIMATION
Observations:
Performance improves when approximating state using
denoising variant of autoencoder for same latent
representation size
Tradeoff when increasing dimensionality of approximated state:
Increase in end-to-end performance
Significant increase in time taken for clustering & fitting a Markov
state model
MONTEZUMA’S REVENGE
• Much higher emphasis on representation learning than Mario
• DeepMind’s DQN reports worst performance on this - 0% compared to
human test player
• After training DQN, we have a 256 real-valued feature vector output by
the last fully connected hidden layer
• Has been observed that the magnitude of the output values themselves
do not matter in an image recognition task
• Hence can binarize the values and obtain a 256-bit binary feature
vector representing a state
• Perform further state approximation using d-Autoencoder

REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMES

  • 1.
    HIERARCHICAL DECISION MAKING USINGSPATIO-TEMPORAL ABSTRACTION REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMES by S.K.Ramnandan Deep Learning Group, IIT-Madras
  • 2.
    How to emulateintelligent behaviour? Spatial abstraction - by ignoring irrelevant sensory input Group sets of primitive states in MDP into abstract states Temporal abstraction - by ignoring fine grain details of actions Extended actions directly take agent from one abstract state to another Identify useful skills
  • 3.
    Motivation for spatialabstraction: Find regions of state space that are well-connected - abstract states Idea from conformal dynamics - metastability: Particles stay in same region of state space for long periods of time without external stimulus Behaviour under random walks Identified using spectral clustering algorithm - PCCA+
  • 4.
    Construct Laplacian oftransition matrix corresponding to random walk on underlying MDP Spectra of the Laplacian encodes the properties of underlying 
 graph Vertices of a simplex which lie on the transformed basis are the abstract states States are classified to abstract states based on their membership to clusters after projection Advantages: Degree of membership of states to each abstract state Connectivity information between abstract states Automatically estimate number of abstract states PCCA+
  • 5.
    Use partitions ofstate space into abstract states along with membership function returned by PCCA+ to compose options for free Thus, use the structural information obtained to define behavioral policies for the subtasks independent of the task being solved Hence these skills may work even for platform games where rewards are hugely delayed 
 TEMPORAL ABSTRACTION: OPTIONS Option policy to go from abstract state 1 to 2 in 3-room domain
  • 6.
    No access toa model of the MDP Have to estimate transition matrix from sampled trajectories Underlying policy while sampling cannot be random since exploration of MDP heavily depends on near-optimal policy ONLINE AGENT FOR PLATFORM GAMES Trajectories Featurization Dimensionality Reduction Clustering Fitting Markov State Model PCCA+
  • 7.
    Exponential state space- 25352 possible states 22 x 16 tiled grid with 25 possible values Higher-level state representation than pixel space FEATURIZATION 12 possible primitive actions Rewards for achieving ‘side’ goals, such as gathering coins and killing monsters MARIO DOMAIN
  • 8.
    After featurization, dimensionalityof state vector = 240 For 10,000 trajectories, time taken to cluster & fit MSM: Curse of dimensionality, local feature relevance problem Reduced dimension representation learning: Deep Q-Network Autoencoder (Denoising) Stacked denoising autoencoder DIMENSIONALITY REDUCTION 1-D 3-D 240-D 15 min 307 min ?
  • 9.
    DQN RL presents challengesfrom a deep learning perspective No direct association between inputs and targets - RL algorithms must be able to learn from a scalar reward signal that is frequently sparse, noisy and delayed Correlated data - In RL, encounter sequences of highly correlated data Non-stationary training distribution - Problematic for deep learning methods that assume a fixed underlying distribution Neural network trained with TD-error acts as non-linear function approximator for action-values Experience replay mechanism - randomly samples previous transitions (s-a-r-s’) from replay pool
  • 10.
    116 8x8 filters 32 4x4 filters Fullyconnected hidden layer Fully connected output layer 84x84x4 input • Deriving an approximate state representation • Compress last hidden layer to simulate encoder in auto encoders • Summarize state by values of neurons in last hidden layer • In case of Mario where input is not in pixel space, replaced convolution layers with fully connected layers Note:
 Contractive nature of reduced dimension as training epochs increases
  • 11.
    AUTOENCODER Cross-entropy error forbinary inputs Directly using loss function for ordinal data inputs ?
  • 12.
    AUTOENCODER (DENOISING) • Isrepresentation learnt from autoencoder useful enough? • Further constraints need to be applied to attempt to separate useful information from noise • Will naturally translate to non-zero reconstruction error • Two implicit underlying ideas: • A higher level representation should be rather stable and robust under corruptions of the input • Performing the denoising task well requires extracting features that capture useful structure in the input distribution
  • 13.
    VISULAZATION OF REDUCEDDIMENSION DQN 1-d DQN 2-d DQN 3-d Auto 1-d Auto 2-d Auto 3-d
  • 14.
    VISULAZATION OF REDUCEDDIMENSION dAuto 1-d dAuto 2-d dAuto 3-d Auto 1-d Auto 2-d Auto 3-d 25%noise0%noise
  • 15.
    RECONSTRUCTION ERROR Reduced dimension Auto dAuto (25% noise) h-1200.559 177.456 h-2 168.765 158.984 h-3 158.751 151.514 h-5 156.246 139.845 dAuto Auto Fall in training cost smoother for denoising autoencoder
  • 16.
    END-TO-ENDTESTING RESULTS FORSTATE APPROXIMATION • Average % increase in return per episode: 15.3% • Average % decrease in time spent per episode: 4.39%
  • 17.
    END-TO-ENDTESTING RESULTS FORSTATE APPROXIMATION Observations: Performance improves when approximating state using denoising variant of autoencoder for same latent representation size Tradeoff when increasing dimensionality of approximated state: Increase in end-to-end performance Significant increase in time taken for clustering & fitting a Markov state model
  • 18.
    MONTEZUMA’S REVENGE • Muchhigher emphasis on representation learning than Mario • DeepMind’s DQN reports worst performance on this - 0% compared to human test player • After training DQN, we have a 256 real-valued feature vector output by the last fully connected hidden layer • Has been observed that the magnitude of the output values themselves do not matter in an image recognition task • Hence can binarize the values and obtain a 256-bit binary feature vector representing a state • Perform further state approximation using d-Autoencoder