OpenAI recently published a fun paper where they showed using evolution algorithms to train policy networks to perform on par with state of the art reinforcement deep learning. In this talk we’ll try to reimplement the main ideas in that paper using Neanderthal (blazing fast matrix and linear algebra computations) and Cortex (neural networks); make it massively distributed using Onyx; build a simulation environment using re-frame; and of course save our princess from no particular harm in our toy game example
2. We will build an AI to play a silly little game
by training a policy network defined using
Cortex, using a hot new training algorithm we
will implement from the paper first using
Neanderthal and then make massively
parallel using Onyx.
3. The game
• Find the shortest path to the princess
• Moves: up, down, left, right
• Don’t fall off the edge of the world
4. The game
• Find the shortest path to the princess
• Moves: up, down, left, right
• Don’t fall off the edge of the world
6. Reinforcement learning
• Interact with the environment [embodied cognition]
• Not a single solution but an action to take given environment
[model of the world + model of self, consciousness?]
• Learns via positive/negative feedback
7. Reinforcement learning:
how it’s usually done
Train a deep neural network using raw sensor
data, usually pixels (ie. no feature engineering)
11. Using ES to train a neural
network
Benefits
• highly parallelizable
• more robust (less hyperparameters, more
stabile, doesn’t care about the properties of
reward function)
• can exploit structure
• less computationally expensive
Downsides
• takes longer to converge
• noise must lead to different outcomes
Instead of backpropagation use ES on weights
15. Neanderthal
• Blazing fast matrix and linear algebra library
• Based on ATLAS and LAPACK
• Runs on CPUs and GPUs
• A study in writing efficient code
• Somewhat terse API (fluokitten helps)
34. Resilience and handling
state
• Activity log
• Window and trigger states checkpointed
• Resume points (transfer state from job to job)
• Configurable flux policies (continue/kill/recover)
37. Cortex
• Neural networks, regression and feature learning
• Clean idiomatic Clojure API
• Computation encoded as data (and makes good use of it)
• Uses core.matrix for heavy lifting