Behavior Regularized Actor Critic (BRAC)
: Introduce a general framework, to empirically evaluate recently proposed methods for offline
continuous control tasks.
• It find that that many of the technical complexities are unnecessary to achieve strong
performance.
In the offline setting, DDPG fails to learn a good policy, even when the dataset is collected by a single
behavior policy, with or without noise added to the behavior policy.
⇒ These failure cases are hypothesized to be caused by erroneous generalization of the state-action
value function (Q-value function) learned with function approximators.
Approaches
1. Random Ensemble Mixture (REM): Apply a random ensemble of Q-value targets to stabilize
the learned Q-function.
2. BCQ and BEAR: Regularize the learned policy towards the behavior policy.