Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017


Published on

Deep Reinforcement Learning with Shallow Trees:
In this talk, I present Concept Network Reinforcement Learning (CNRL), developed at Bonsai. It is an industrially applicable approach to solving complex tasks using reinforcement learning, which facilitates problem decomposition, allows component reuse, and simplifies reward functions. Inspired by Sutton’s options framework, we introduce the notion of “Concept Networks” which are tree-like structures in which leaves are “sub-concepts” (sub-tasks), representing policies on a subset of state space. The parent (non-leaf) nodes are “Selectors”, containing policies on which sub-concept to choose from the child nodes, at each time during an episode. There will be a high-level overview on the reinforcement learning fundamentals at the beginning of the talk.

Bio: Matineh Shaker is an Artificial Intelligence Scientist at Bonsai in Berkeley, CA, where she builds machine learning, reinforcement learning, and deep learning tools and algorithms for general purpose intelligent systems. She was previously a Machine Learning Researcher at Geometric Intelligence, Data Science Fellow at Insight Data Science, Predoctoral Fellow at Harvard Medical School. She received her PhD from Northeastern University with a dissertation in geometry-inspired manifold learning.

Published in: Technology
  • Login to see the comments

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

  1. 1. Deep Reinforcement Learning with Shallow Trees Matineh Shaker AI Scientist (Bonsai) MLConf San Francisco 10 November 2017
  2. 2. Outline ● Introduction to RL (Reinforcement Learning) ● Markov decision processes ● Value-based methods ● Concept-Network Reinforcement Learning (CNRL) ● Use cases 2
  3. 3. A Reinforcement Learning Example 3 Rocket Trajectory Optimization: OpenAI Gym’s LunarLander Simulator
  4. 4. A Reinforcement Learning Example 4 State: x_position y_position x_velocity y_velocity angle angular velocity left_leg right_leg Action (Discrete): do nothing (0) fire left engine (1) fire main engine (2) fire right engine (3) Action (Continuous): main engine power left/right engine power Reward: Moving from the top of the screen to landing pad and zero speed has about 100-140 points. Episode finishes if the lander crashes or comes to rest, additional -100 or +100. Each leg ground contact is +10. Firing main engine has -0.3 points each frame.
  5. 5. Basic RL Concepts 5 Reward Hypothesis Goals can be described by maximizing the expected cumulative reward . Sequential Decision Making Actions may have long-term consequences. Rewards may be delayed, like a financial investment. Sometimes the agent sacrifices instant rewards to maximize long-term reward (just like life!) State Data Sequential and non i.i.d Agent’s actions affect the next data samples.
  6. 6. Definitions Policy Dictates agent’s behavior, and maps from state to action: Deterministic policy: a = Л(s) Stochastic policy: Л(a|s) = P(At = a|St = s) Value function Determines how good each state (and action) is: VЛ (s)=EЛ [ Rt+1 + Rt+2 + 2 Rt+3 +... | St =s ] QЛ (s,a) Model Predicts what the environment will do next (simulator’s job for instance) 6
  7. 7. Agent and Environment At each time step, the agent: Receives observation Receives reward Takes action The environment: Receives action Sends next observation Sends next reward 7
  8. 8. Markov Decision Processes (MDP) 8 Mathematical framework for sequential decision making. An environment in which all states are Markovian: Markov Decision Process is a tuple: Pictures from David Silver’s Slides
  9. 9. Exploration vs. Exploitation Exploration vs. Exploitation Dilemma ● Reinforcement learning (specially model-free) is like trial-and-error learning. ● The agent should find a good policy that maximizes future rewards from its experiences of the environment, in a potentially very large state space. ● Exploration finds more information about the environment, while Exploitation exploits known information to maximise reward. 9
  10. 10. Value Based Methods: Q-Learning What are the Problems: ● The iterative update is not scalable enough: ● Computing Q(s,a) for every state-action pair is not feasible most of the times. Solution: ● Use a function approximator to estimate Q(s,a). such as a neural network! (differentiable) 10 Using Bellman equation as an iterative update, to find optimal policy:
  11. 11. Value Based Methods: Q-Learning Use a function approximator to estimate the action-value function: Q(s, a; ) ≅ Q*(s, a) is the function parameter (weights of NN) Function approximator can be a deep neural network: DQN 11 Loss Function:
  12. 12. Value Based Methods: DQN Learning from batches of consecutive samples is problematic and costly: - Sample correlation: Samples are correlated, which in return, makes inefficient learning - Bad feedback loops: Current Q-network parameters dictates next training samples and can lead to bad feedback loops (e.g if maximizing action is to move left, training samples will be dominated by samples from left-hand size) To solve them, use Experience Replay - Continually update a replay memory table of transitions (st , at , rt , st+1 ). - Train Q-network on random mini-batches of transitions from the replay memory. 12
  13. 13. Concept Network Reinforcement Learning ● Solving complex tasks by decomposing them to high level actions or "concepts". ● “Multi-level hierarchical RL” approach, inspired by Sutton’s Options: ○ enables efficient exploration by the abstractions over low level actions, ○ improving sample efficiency significantly, ○ especially in “sparse reward”. ● Allows existing solutions to sub-problems to be composed into an overall solution without requiring re-training. 13
  14. 14. Temporal Abstractions ● At each time t for each state st , a higher level “selector” chooses concept ct among all possible concepts available to the selector. ● Each concept remains active for some time, until a predefined terminal state is reached. ● An internal critic evaluates how close the agent is to satisfying a terminal condition of ct , and sends reward rc (t) to the selector. ● Similar to baseline RL, except that an extra layer of abstraction is defined on the set of “primitive” actions, forming a concept, so that execution of each concept corresponds to a certain action. 14
  15. 15. LunarLander with Concepts 15
  16. 16. LunarLander with Concepts 16
  17. 17. Robotics Pick and Place with Concepts 17 Lift Orient Stack
  18. 18. Robotics Pick and Place with Concepts 18
  19. 19. Robotics Pick and Place with Concepts 19 Deep Reinforcement Learning for Dexterous Manipulation with Concept Networks
  20. 20. Thank you! 20
  21. 21. Backup Slides for Q/A: 21
  22. 22. Definitions State The agent’s internal representation in the environment. Information the agent uses to pick the next action. Policy Dictates agent’s behavior, and maps from state to action: Deterministic policy: a = Л(s) Stochastic policy: Л(a|s) = P(At = a|St = s) Value function Determines how good each state (and action) is: VЛ (s)=EЛ [ Rt+1 + Rt+2 + 2 Rt+3 +... | St =s ] QЛ (s,a) Model Predicts what the environment will do next (simulator’s job for instance) 22
  23. 23. RL’s Main Loop 23
  24. 24. Value Based Methods: DQN with Experience Replay(2) 24
  25. 25. Learning vs Planning 25 Learning (Model-Free Reinforcement Learning): The environment is initially unknown The agent interacts with the environment, not knowing about the environment The agent improves its policy based on previous interactions Planning (Model-based Reinforcement Learning): A model of the environment is known or acquired The agent performs computations with the model, without any external interaction The agent improves its policy based on those computations with the model
  26. 26. LunarLander with Concept Network 26
  27. 27. Introduction to RL: Challenges 27 Playing Atari with Deep Reinforcement Learning, Mnih et al, Deepmind
  28. 28. Policy-Based Methods ● The Q-function can be complex and unnecessary. All we want is best action!! ● Example: In a very high-dimensional state, it is wasteful and costly to learn exact value of every (state, action) pair. 28 ● Defining parameterized policies: ● For each policy, define its value: ● Gradient ascent on policy parameters to find the optimal policy!