RL is one of the methods that might come to rescue. The basic idea of RL is very simple. We have an agent who perceives the state from the environment, takes an action and receives a reward. The environment receives the action, changes its internal state and repeats.
With virtual environments, we could potentially get infinite amount of data to train our models.
Since then, deep reinforcement learning has make substantial progress in different kind of games, including Atari games, Go, DoTA 2, etc.
Then the question is, what is the next step?
In this talk, I am going to talk about a few recent works that mainly explore the power of planning in reinforcement learning setting.
1.
I will start with one example why planning is important.
We all know that last year, DeepMind has released a paper in Nature about AlphaGoZero, which learns super-human Go bot without any human knowledge.
The idea is very simple, starting from random generated models, we use planning algorithms like MCTS to find the best moves in each step, and save the selfplay into a replay buffer. Then we update the models based on the self-replay buffer. Then repeat.
This approach is surprisingly simple, yet it gives very strong performance. 3 days with thousands of TPUs and the model can already beat AlphaGo Lee, after 40 days it defeats AlphaGo Master.
Inspired by this interesting and exciting results, we thus try reproducing AlphaGoZero with our recent published ELF platform. The goal is to understand why such a simple approach yields a strong performance.
ELF is an Extensive, Lightweight and Flexible framework that makes a practical RL system easy and manageable.
For this, ELF puts all the implementation and design details into the C++ side, leaving the Python side a simple for-loop, as advertised in the text book.
Moreover, each time the python side returns a batch for a neural network model to operate on, improving its efficiency.
The key idea of ELF is to achieve dynamic batching from multiple game instances. In this platform, many games are running simultaneously. From time to time each have requests to call deep learning API (like PyTorch) for computing the next states and actions, or store the game history into replay buffers. ELF provides a dynamic batching interface so that requests can be automatically batched for high thoroughput.
We improve our framework to support distributed setting, and putting AGZ and AZ together.
We then releases ELF OpenGo, reproduces AZ and AGZ.
One question is that, ok, what can we learn from it?
One interesting observation from this experiment is that the knowledge is propagated backwards. You start to see meaningful moves from the end of the game, where there is reward signal. Over iterations, the meaningful moves are backpropagated all the way to the opening of the game.
The right figure shows that even with well-trained models, still with the additional planning the strength is much higher. In fact, the strength of the bot is always constrained by the capacity of the model.
We are working on an arXiv paper to discuss about the training details.
Not only training AGZ/AZ requires planning. For general reinforcement learning, during training, planning is also very important. This not only includes 1 step look forward, but also including T-step looking forward as well as more complicated planning mechanism like tree search. A large portion of DRL is to try pushing the results given by planning into neural networks.
Planning is also very important for general game AI. Chess or other games are other important examples.
Traditional