We propose a novel value function approximation technique for Markov decision processes that compactly represents the state-action value function using a low-rank and sparse matrix model. Under minimal assumptions, this decomposition is a Robust Principal Component Analysis problem that can be solved exactly via the Principal Component Pursuit convex optimization problem.
3. Value function approximation
Markov decision process can be solved optimally given the
state-action value function
– value function gives utility for taking an action given a state; want to
find action that maximizes utility
– can be represented as a matrix for discrete problems
– typically millions or billions of dimensions for practical problems
value function approximation finds compact alternative
– basis functions used widely in reinforcement learning (RL)
– e.g., Gaussian radial basis function, neural network
Introduction 3
4. Value function decomposition
idea: approximate value function as low-rank plus sparse components
assumes intrinsic low-dimensionality
– i.e., value function can be captured by small set of features
– hinted by success of basis function approximation in RL
falls under category of Robust Principal Component Analysis (PCA)
– widely used in image/video analysis and collaborative filtering; e.g.,
Netflix challenge
– novel application of Robust PCA as far as author is aware
Introduction 4
6. Markov decision process
defined by the tuple (S, A, T, R)
S and A are the sets of all possible states and actions, respectively
T gives the probability of transitioning into state s from taking
action a at the current state s, and is often denoted T (s, a, s )
R gives a scalar value indicating the immediate reward received for
taking action a at the current state s and is denoted R (s, a)
Formulation 6
7. Value iteration
want to find the optimal policy π (s)
returns action that maximizes the utility from any given state
related to state-action value function Q (s, a)
π (s) = argmax
a∈A
Q (s, a)
value iteration updates value function guess ˆQ until convergence
ˆQ (s, a) := R (s, a) +
s ∈S
T (s, a, s ) max
a ∈A
ˆQ (s , a )
Formulation 7
8. Matrix decomposition
suppose matrix M ∈ Rm×n
encodes Q (s, a)
– m and n are the cardinalities of the state and action spaces
approximate with decomposition M = L0 + S0
– L0 and S0 are the true low-rank and sparse components
why should this work?
– implicit assumption about correlation of utility values across actions
Formulation 8
11. Principal Component Pursuit (PCP)
best (known) convex estimate of Robust PCA
minimize L ∗ + λ S 1
subject to L + S = M
intuitively
– nuclear norm · ∗ is best convex approximation to minimizing rank
– 1-norm has sparsifying property
remarkably, solution to PCP decomposes M perfectly [CLMW11]
Approach 11
16. References
Emmanuel J Candes, Xiaodong Li, Yi Ma, and John Wright.
Robust principal component analysis?
Journal of the Association for Computing Machinery, 58(3), 2011.
16