SlideShare a Scribd company logo
Introduction to
Deep Reinforcement
Learning
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Deep Reinforcement Learning (Deep RL)
• What is it? Framework for learning to solve sequential decision
making problems.
• How? Trial and error in a world that provides occasional rewards
• Deep? Deep RL = RL + Neural Networks
[306, 307]
Deep Learning Deep RL
2019https://deeplearning.mit.edu
Types of Learning
• Supervised Learning
• Semi-Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
It’s all “supervised” by a loss function!
*Someone has to say what’s good and what’s bad (see Socrates, Epictetus, Kant, Nietzsche, etc.)
Neural
Network
Input Output
Good or
Bad?
Supervision*
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Supervised learning is “teach by example”:
Here’s some examples, now learn patterns in these example.
Reinforcement learning is “teach by experience”:
Here’s a world, now learn patterns by exploring it.
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Machine Learning:
Supervised vs Reinforcement
Supervised learning is “teach by example”:
Here’s some examples, now learn patterns in these example.
Reinforcement learning is “teach by experience”:
Here’s a world, now learn patterns by exploring it.
[330]
Failure Success
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Reinforcement Learning in Humans
• Human appear to learn to walk through “very few examples” of
trial and error. How is an open question…
• Possible answers:
• Hardware: 230 million years of bipedal movement data.
• Imitation Learning: Observation of other humans walking.
• Algorithms: Better than backpropagation and stochastic gradient descent
[286, 291]
2019https://deeplearning.mit.edu
Open Question:
What can be learned from
data?
Environment
Sensors
Sensor Data
Feature Extraction
Machine Learning
Reasoning
Knowledge
Effector
Planning
Action
Representation
2019https://deeplearning.mit.edu
Environment
Sensors
Sensor Data
Feature Extraction
Machine Learning
Reasoning
Knowledge
Effector
Planning
Action
Representation
References: [132]
Lidar
Camera
(Visible, Infrared)
Radar
Stereo Camera Microphone
GPS
IMU
Networking
(Wired, Wireless)
2019https://deeplearning.mit.edu
Environment
Sensors
Sensor Data
Feature Extraction
Machine Learning
Reasoning
Knowledge
Effector
Planning
Action
Representation
2019https://deeplearning.mit.edu
Environment
Sensors
Sensor Data
Feature Extraction
Machine Learning
Reasoning
Knowledge
Effector
Planning
Action
Representation
2019https://deeplearning.mit.edu
Environment
Sensors
Sensor Data
Feature Extraction
Machine Learning
Reasoning
Knowledge
Effector
Planning
Action
Representation
Image Recognition:
If it looks like a duck
Activity Recognition:
Swims like a duck
Audio Recognition:
Quacks like a duck
2019https://deeplearning.mit.edu
Environment
Sensors
Sensor Data
Feature Extraction
Machine Learning
Reasoning
Knowledge
Effector
Planning
Action
Representation
Final breakthrough, 358 years after its conjecture:
“It was so indescribably beautiful; it was so simple and
so elegant. I couldn’t understand how I’d missed it and
I just stared at it in disbelief for twenty minutes. Then
during the day I walked around the department, and
I’d keep coming back to my desk looking to see if it
was still there. It was still there. I couldn’t contain
myself, I was so excited. It was the most important
moment of my working life. Nothing I ever do again
will mean as much."
2019https://deeplearning.mit.edu
Environment
Sensors
Sensor Data
Feature Extraction
Machine Learning
Reasoning
Knowledge
Effector
Planning
Action
Representation
References: [133]
2019https://deeplearning.mit.edu
Environment
Sensors
Sensor Data
Feature Extraction
Machine Learning
Reasoning
Knowledge
Effector
Planning
Action
Representation
The promise of
Deep Learning
The promise of
Deep Reinforcement Learning
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Reinforcement Learning Framework
At each step, the agent:
• Executes action
• Observe new state
• Receive reward
[80]
Open Questions:
• What cannot be modeled in
this way?
• What are the challenges of
learning in this framework?
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Environment and Actions
• Fully Observable (Chess) vs Partially Observable (Poker)
• Single Agent (Atari) vs Multi Agent (DeepTraffic)
• Deterministic (Cart Pole) vs Stochastic (DeepTraffic)
• Static (Chess) vs Dynamic (DeepTraffic)
• Discrete (Chess) vs Continuous (Cart Pole)
Note: Real-world environment might not technically be stochastic
or partially-observable but might as well be treated as such due
to their complexity.
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
The Challenge for RL in Real-World Applications
Reminder:
Supervised learning:
teach by example
Reinforcement learning:
teach by experience
Open Challenges. Two Options:
1. Real world observation + one-shot trial & error
2. Realistic simulation + transfer learning
1. Improve
Transfer
Learning
2. Improve
Simulation
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Major Componentsof anRLAgent
An RL agent may be directly or indirectly trying to learn a:
• Policy: agent’s behavior function
• Value function: how good is each state and/or action
• Model: agent’s representation of the environment
𝑠0, 𝑎0, 𝑟1, 𝑠1, 𝑎1, 𝑟2, … , 𝑠 𝑛−1, 𝑎 𝑛−1, 𝑟 𝑛, 𝑠 𝑛
state
action
reward
Terminal state
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Meaning of Life for RL Agent: Maximize Reward
[84]
• Future reward: 𝑅𝑡 = 𝑟𝑡 + 𝑟𝑡+1 + 𝑟𝑡+2 + ⋯ + 𝑟𝑛
• Discounted future reward:
𝑅 𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟 𝑡+2 + ⋯ + 𝛾 𝑛−𝑡 𝑟 𝑛
• A good strategy for an agent would be to always choose
an action that maximizes the (discounted) future reward
• Why “discounted”?
• Math trick to help analyze convergence
• Uncertainty due to environment stochasticity, partial
observability, or that life can end at any moment:
“If today were the last day of my life, would I want
to do what I’m about to do today?” – Steve Jobs
2019https://deeplearning.mit.edu
actions: UP, DOWN, LEFT, RIGHT
(Stochastic) model of the world:
Action: UP
80% move UP
10% move LEFT
10% move RIGHT
Robot in a Room
+1
-1
START
• Reward +1 at [4,3], -1 at [4,2]
• Reward -0.04 for each step
• What’s the strategy to achieve max reward?
• We can learn the model and plan
• We can learn the value of (action, state) pairs and act greed/non-greedy
• We can learn the policy directly while sampling from it
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Optimal Policy for a Deterministic World
Reward: -0.04 for each step
actions: UP, DOWN, LEFT, RIGHT
When actions are deterministic:
UP
100% move UP
0% move LEFT
0% move RIGHT
Policy: Shortest path.
+1
-1
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Optimal Policy for a Stochastic World
Reward: -0.04 for each step
+1
-1
actions: UP, DOWN, LEFT, RIGHT
When actions are stochastic:
UP
80% move UP
10% move LEFT
10% move RIGHT
Policy: Shortest path. Avoid -UP around -1 square.
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Optimal Policy for a Stochastic World
Reward: -2 for each step
+1
-1
actions: UP, DOWN, LEFT, RIGHT
When actions are stochastic:
UP
80% move UP
10% move LEFT
10% move RIGHT
Policy: Shortest path.
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Optimal Policy for a Stochastic World
Reward: -0.1 for each step
+1
-1
+1
-1
Reward: -0.04 for each step
Less urgentMore urgent
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Optimal Policy for a Stochastic World
Reward: +0.01 for each step
actions: UP, DOWN, LEFT, RIGHT
When actions are stochastic:
UP
80% move UP
10% move LEFT
10% move RIGHT
Policy: Longest path.
+1
-1
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Lessons from Robot in Room
• Environment model has big impact on optimal policy
• Reward structure has big impact on optimal policy
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Reward structure may have
Unintended Consequences
Player gets reward based on:
1. Finishing time
2. Finishing position
3. Picking up “turbos”
[285]
Human AI (Deep RL Agent)
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
AI Safety
Risk (and thus Human Life) Part of the Loss Function
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Examples of Reinforcement Learning
Cart-Pole Balancing
• Goal— Balance the pole on top of a moving cart
• State— Pole angle, angular speed. Cart position, horizontal velocity.
• Actions— horizontal force to the cart
• Reward— 1 at each time step if the pole is upright
[166]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Examples of Reinforcement Learning
Doom*
• Goal:
Eliminate all opponents
• State:
Raw game pixels of the game
• Actions:
Up, Down, Left, Right, Shoot, etc.
• Reward:
• Positive when eliminating an opponent,
negative when the agent is eliminated
[166]
* Added for important thought-provoking considerations of AI safety in the context of
autonomous weapons systems (see AGI lectures on the topic).
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Examples of Reinforcement Learning
Grasping Objects with Robotic Arm
• Goal - Pick an object of different shapes
• State - Raw pixels from camera
• Actions – Move arm. Grasp.
• Reward - Positive when pickup is successful
[332]
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
Examples of Reinforcement Learning
Human Life
• Goal - Survival? Happiness?
• State - Sight. Hearing. Taste. Smell. Touch.
• Actions - Think. Move.
• Reward – Homeostasis?
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
Key Takeaways for Real-World Impact
• Deep Learning:
• Fun part: Good algorithms that learn from data.
• Hard part: Good questions, huge amounts of representative data.
• Deep Reinforcement Learning:
• Fun part: Good algorithms that learn from data.
• Hard part: Defining a useful state space, action space, and reward.
• Hardest part: Getting meaningful data for the above formalization.
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
3 Types of Reinforcement Learning
Model-based
• Learn the model of
the world, then plan
using the model
• Update model often
• Re-plan often
[331, 333]
Value-based
• Learn the state or
state-action value
• Act by choosing best
action in state
• Exploration is a
necessary add-on
Policy-based
• Learn the stochastic
policy function that
maps state to action
• Act by sampling policy
• Exploration is baked in
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Taxonomy of RL Methods
[334]
Link: https://spinningup.openai.com
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Taxonomy of RL Methods
[334]
2019https://deeplearning.mit.edu
Q-Learning
• State-action value function: Q(s,a)
• Expected return when starting in s,
performing a, and following 
• Q-Learning: Use any policy to estimate Q that maximizes future reward:
• Q directly approximates Q* (Bellman optimality equation)
• Independent of the policy being followed
• Only requirement: keep updating each (s,a) pair
s
a
s’
r
New State Old State Reward
Learning Rate Discount Factor
2019https://deeplearning.mit.edu
Exploration vs Exploitation
• Deterministic/greedy policy won’t explore all actions
• Don’t know anything about the environment at the beginning
• Need to try all actions to find the optimal one
• ε-greedy policy
• With probability 1-ε perform the optimal/greedy action, otherwise random action
• Slowly move it towards greedy policy: ε -> 0
2019https://deeplearning.mit.edu
Q-Learning: Value Iteration
References: [84]
A1 A2 A3 A4
S1 +1 +2 -1 0
S2 +2 0 +1 -2
S3 -1 +1 0 -2
S4 -2 0 +1 +1
2019https://deeplearning.mit.edu
Q-Learning: Representation Matters
• In practice, Value Iteration is impractical
• Very limited states/actions
• Cannot generalize to unobserved states
• Think about the Breakout game
• State: screen pixels
• Image size: 𝟖𝟒 × 𝟖𝟒 (resized)
• Consecutive 4 images
• Grayscale with 256 gray levels
𝟐𝟓𝟔 𝟖𝟒×𝟖𝟒×𝟒 rows in theQ-table!
= 1069,970 >> 1082 atoms in the universe
References: [83, 84]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Deep RL = RL + Neural Networks
[20]
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
DQN: Deep Q-Learning
[83]
Use a neural network to
approximate the Q-function:
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
Deep Q-Network (DQN): Atari
[83]
Mnih et al. "Playing atari with deep reinforcement learning." 2013.
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
DQN and Double DQN
• Loss function (squared error):
[83]
predictiontarget
• DQN: same network for both Q
• Double DQN: separate network for each Q
• Helps reduce bias introduced by the inaccuracies of
Q network at the beginning of training
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
DQN Tricks
• Experience Replay
• Stores experiences (actions, state transitions, and rewards) and creates
mini-batches from them for the training process
• Fixed Target Network
• Error calculation includes the target function depends on network
parameters and thus changes quickly. Updating it only every 1,000
steps increases stability of training process.
[83, 167]
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
Atari Breakout
[85]
After
120 Minutes
of Training
After
10 Minutes
of Training
After
240 Minutes
of Training
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
DQN Resultsin Atari
[83]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Dueling DQN (DDQN)
• Decompose Q(s,a)
• V(s): the value of being at that state
• A(s,a): the advantage of taking action a in state s versus all other possible
actions at that state
• Use two streams:
• one that estimates the state value V(s)
• one that estimates the advantage for each action A(s,a)
• Useful for states where action choice does not affect Q(s,a)
[336]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Prioritized Experience Replay
• Sample experiences based on impact not frequency of occurrence
[336]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Taxonomy of RL Methods
[334]
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
Policy Gradient (PG)
• DQN (off-policy): Approximate Q and infer optimal policy
• PG (on-policy): Directly optimize policy space
[63]
Good illustrative explanation:
http://karpathy.github.io/2016/05/31/rl/
“Deep Reinforcement Learning:
Pong from Pixels”
Policy Network
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
Policy Gradient – Training
[63, 204]
• REINFORCE: Policy gradient that increases probability of good actions and
decreases probability of bad action:
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Policy Gradient
• Pros vs DQN:
• Messy World: If Q function is too complex to be learned, DQN may fail
miserably, while PG will still learn a good policy.
• Speed: Faster convergence
• Stochastic Policies: Capable of learning stochastic policies - DQN can’t
• Continuous actions: Much easier to model continuous action space
• Cons vs DQN:
• Data: Sample inefficient (needs more data)
• Stability: Less stable during training process.
• Poor credit assignment to (state, action) pairs for delayed rewards
[63, 335]
Problem with REINFORCE:
Calculating the reward at the
end, means all the actions will
be averaged as good because
the total reward was high.
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Taxonomy of RL Methods
[334]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Advantage Actor-Critic (A2C)
• Combine DQN (value-based) and REINFORCE (policy-based)
• Two neural networks (Actor and Critic):
• Actor is policy-based: Samples the action from a policy
• Critic is value-based: Measures how good the chosen action is
[335]
• Update at each time step - temporal difference (TD) learning
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Asynchronous Advantage Actor-Critic (A3C)
[337]
• Both use parallelism in training
• A2C syncs up for global parameter update and then start each
iteration with the same policy
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Taxonomy of RL Methods
[334]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Deep Deterministic Policy Gradient (DDPG)
• Actor-Critic framework for learning a deterministic policy
• Can be thought of as: DQN for continuous action spaces
• As with all DQN, following tricks are required:
• Experience Replay
• Target network
• Exploration: add noise to actions,
reducing scale of the noise
as training progresses
[341]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Taxonomy of RL Methods
[334]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Policy Optimization
• Progress beyond Vanilla Policy Gradient:
• Natural Policy Gradient
• TRPO
• PPO
• Basic idea in on-policy optimization:
Avoid taking bad actions that collapse the training performance.
[339]
Line Search:
First pick direction, then step size
Trust Region:
First pick step size, then direction
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Taxonomy of RL Methods
[334]
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
Game of Go
[170]
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
AlphaGo (2016) Beat Top Human at Go
[83]
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
AlphaGo Zero (2017): Beats AlphaGo
[149]
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
AlphaGo Zero Approach
[170]
• Same as the best before: Monte Carlo Tree Search (MCTS)
• Balance exploitation/exploration (going deep on promising positions or
exploring new underplayed positions)
• Use a neural network as “intuition” for which positions to
expand as part of MCTS (same as AlphaGo)
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
AlphaGo Zero Approach
[170]
• Same as the best before: Monte Carlo Tree Search (MCTS)
• Balance exploitation/exploration (going deep on promising positions or
exploring new underplayed positions)
• Use a neural network as “intuition” for which positions to
expand as part of MCTS (same as AlphaGo)
• “Tricks”
• Use MCTS intelligent look-ahead (instead of human games) to improve
value estimates of play options
• Multi-task learning: “two-headed” network that outputs (1) move
probability and (2) probability of winning.
• Updated architecture: use residual networks
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
AlphaZero vs Chess, Shogi, Go
[340]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
AlphaZero vs Chess, Shogi, Go
[340]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
AlphaZero vs Chess, Shogi, Go
[340]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
To date, for most successful robots operating in the real world:
Deep RL is not involved
[169]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
To date, for most successful robots operating in the real world:
Deep RL is not involved
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
But… that’s slowly changing:
Learning Control Dynamics
[343]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
But… that’s slowly changing:
Learning to Drive: Beyond Pure Imitation (Waymo)
[343]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
The Challenge for RL in Real-World Applications
Reminder:
Supervised learning:
teach by example
Reinforcement learning:
teach by experience
Open Challenges. Two Options:
1. Real world observation + one-shot trial & error
2. Realistic simulation + transfer learning
1. Improve
Transfer
Learning
2. Improve
Simulation
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Thinking Outside the Box:
Multiverse Theory and the Simulation Hypothesis
• Create an (infinite) set of simulation environments to learn in
so that our reality becomes just another sample from the set.
[342]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Next Steps in Deep RL
• Lectures: https://deeplearning.mit.edu
• Tutorials: https://github.com/lexfridman/mit-deep-learning
• Advice for Research (from Spinning Up as a Deep RL Researcher by Joshua Achiam)
• Background
• Fundamentals in probability, statistics, multivariate calculus.
• Deep learning basics
• Deep RL basics
• TensorFlow (or PyTorch)
• Learn by doing
• Implement core deep RL algorithms (discussed today)
• Look for tricks and details in papers that were key to get it to work
• Iterate fast in simple environments (see Einstein quote on simplicity)
• Research
• Improve on an existing approach
• Focus on an unsolved task / benchmark
• Create a new task / problem that hasn’t been addressed with RL
[339]

More Related Content

What's hot

Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
Kuppusamy P
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
Knoldus Inc.
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
MeetupDataScienceRoma
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
Omar Enayet
 
Review MLP Mixer
Review MLP MixerReview MLP Mixer
Review MLP Mixer
Woojin Jeong
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
Salem-Kabbani
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
Yoonho Lee
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
Dongmin Lee
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
Jie-Han Chen
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
Jie-Han Chen
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
Nikolay Pavlov
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
Taehoon Kim
 
Metaheuristics
MetaheuristicsMetaheuristics
Metaheuristics
ossein jain
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
Databricks
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
Prof. Neeta Awasthy
 
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsMulti PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Jisang Yoon
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
Kai-Wen Zhao
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
Mikko Mäkipää
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
Sanghyuk Chun
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017
mooopan
 

What's hot (20)

Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Review MLP Mixer
Review MLP MixerReview MLP Mixer
Review MLP Mixer
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
 
Metaheuristics
MetaheuristicsMetaheuristics
Metaheuristics
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsMulti PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017
 

Similar to MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman

MIT Deep Learning Basics: Introduction and Overview by Lex Fridman
MIT Deep Learning Basics: Introduction and Overview by Lex FridmanMIT Deep Learning Basics: Introduction and Overview by Lex Fridman
MIT Deep Learning Basics: Introduction and Overview by Lex Fridman
Peerasak C.
 
Finding Leverage with System Dynamics
Finding Leverage with System DynamicsFinding Leverage with System Dynamics
Finding Leverage with System Dynamics
Foundation for Healthy Generations
 
Blind Spots in Persuasive Design
Blind Spots in Persuasive DesignBlind Spots in Persuasive Design
Blind Spots in Persuasive Design
Sebastian Deterding
 
D u e o n A p r i l 1 3 , 2 0 1 7 Accounting 212 F.docx
 D u e  o n  A p r i l  1 3 ,  2 0 1 7  Accounting 212 F.docx D u e  o n  A p r i l  1 3 ,  2 0 1 7  Accounting 212 F.docx
D u e o n A p r i l 1 3 , 2 0 1 7 Accounting 212 F.docx
aryan532920
 
An Introduction to Usability
An Introduction to UsabilityAn Introduction to Usability
An Introduction to Usability
dirk.swart
 
Emerce GAUC 2020 - Lars Harmsen - Beerwulf - Brewing a culture of data-driven...
Emerce GAUC 2020 - Lars Harmsen - Beerwulf - Brewing a culture of data-driven...Emerce GAUC 2020 - Lars Harmsen - Beerwulf - Brewing a culture of data-driven...
Emerce GAUC 2020 - Lars Harmsen - Beerwulf - Brewing a culture of data-driven...
Lars Harmsen
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
OPEN KNOWLEDGE GmbH
 
Measuring Direct Social Media ROI
Measuring Direct Social Media ROIMeasuring Direct Social Media ROI
Measuring Direct Social Media ROIKemp Edmonds
 
Baworld adapting to whats happening
Baworld adapting to whats happeningBaworld adapting to whats happening
Baworld adapting to whats happening
Dave Davis PMP, PgMP, PBA
 
User Interface and User Experience - A Process and Strategy for Small Teams
User Interface and User Experience - A Process and Strategy for Small TeamsUser Interface and User Experience - A Process and Strategy for Small Teams
User Interface and User Experience - A Process and Strategy for Small Teams
Damon Sanchez
 
Enterprise social what is the real value to the business - collab con - mar...
Enterprise social   what is the real value to the business - collab con - mar...Enterprise social   what is the real value to the business - collab con - mar...
Enterprise social what is the real value to the business - collab con - mar...Ruven Gotz
 
Be Networked, Use Measurement, and Learn from Your Data
Be Networked, Use Measurement, and Learn from Your DataBe Networked, Use Measurement, and Learn from Your Data
Be Networked, Use Measurement, and Learn from Your Data
Beth Kanter
 
Online Community Practices
Online Community PracticesOnline Community Practices
Online Community Practices
Nancy Wright White
 
Human connectedness and meaningful conversations - how coaching boosts the su...
Human connectedness and meaningful conversations - how coaching boosts the su...Human connectedness and meaningful conversations - how coaching boosts the su...
Human connectedness and meaningful conversations - how coaching boosts the su...
Greatness Coaching
 
YTH Live 2019 (youth+tech+health): Online to Offline Impact Measurement
YTH Live 2019 (youth+tech+health): Online to Offline Impact MeasurementYTH Live 2019 (youth+tech+health): Online to Offline Impact Measurement
YTH Live 2019 (youth+tech+health): Online to Offline Impact Measurement
Whole Whale
 
How Nonprofits Open Professional Doors
How Nonprofits Open Professional DoorsHow Nonprofits Open Professional Doors
How Nonprofits Open Professional Doors
Omar Garriott
 
Collaboration and enterprise social tools-SharePointAlooza - 2015
Collaboration and enterprise social tools-SharePointAlooza - 2015Collaboration and enterprise social tools-SharePointAlooza - 2015
Collaboration and enterprise social tools-SharePointAlooza - 2015
Ruven Gotz
 
3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx
3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx
3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx
BHANU281672
 
3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx
3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx
3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx
lorainedeserre
 
Writing Paper Set Elegant Floral Print Stationery. Etsy
Writing Paper Set Elegant Floral Print Stationery.  EtsyWriting Paper Set Elegant Floral Print Stationery.  Etsy
Writing Paper Set Elegant Floral Print Stationery. Etsy
Michelle Meienburg
 

Similar to MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman (20)

MIT Deep Learning Basics: Introduction and Overview by Lex Fridman
MIT Deep Learning Basics: Introduction and Overview by Lex FridmanMIT Deep Learning Basics: Introduction and Overview by Lex Fridman
MIT Deep Learning Basics: Introduction and Overview by Lex Fridman
 
Finding Leverage with System Dynamics
Finding Leverage with System DynamicsFinding Leverage with System Dynamics
Finding Leverage with System Dynamics
 
Blind Spots in Persuasive Design
Blind Spots in Persuasive DesignBlind Spots in Persuasive Design
Blind Spots in Persuasive Design
 
D u e o n A p r i l 1 3 , 2 0 1 7 Accounting 212 F.docx
 D u e  o n  A p r i l  1 3 ,  2 0 1 7  Accounting 212 F.docx D u e  o n  A p r i l  1 3 ,  2 0 1 7  Accounting 212 F.docx
D u e o n A p r i l 1 3 , 2 0 1 7 Accounting 212 F.docx
 
An Introduction to Usability
An Introduction to UsabilityAn Introduction to Usability
An Introduction to Usability
 
Emerce GAUC 2020 - Lars Harmsen - Beerwulf - Brewing a culture of data-driven...
Emerce GAUC 2020 - Lars Harmsen - Beerwulf - Brewing a culture of data-driven...Emerce GAUC 2020 - Lars Harmsen - Beerwulf - Brewing a culture of data-driven...
Emerce GAUC 2020 - Lars Harmsen - Beerwulf - Brewing a culture of data-driven...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Measuring Direct Social Media ROI
Measuring Direct Social Media ROIMeasuring Direct Social Media ROI
Measuring Direct Social Media ROI
 
Baworld adapting to whats happening
Baworld adapting to whats happeningBaworld adapting to whats happening
Baworld adapting to whats happening
 
User Interface and User Experience - A Process and Strategy for Small Teams
User Interface and User Experience - A Process and Strategy for Small TeamsUser Interface and User Experience - A Process and Strategy for Small Teams
User Interface and User Experience - A Process and Strategy for Small Teams
 
Enterprise social what is the real value to the business - collab con - mar...
Enterprise social   what is the real value to the business - collab con - mar...Enterprise social   what is the real value to the business - collab con - mar...
Enterprise social what is the real value to the business - collab con - mar...
 
Be Networked, Use Measurement, and Learn from Your Data
Be Networked, Use Measurement, and Learn from Your DataBe Networked, Use Measurement, and Learn from Your Data
Be Networked, Use Measurement, and Learn from Your Data
 
Online Community Practices
Online Community PracticesOnline Community Practices
Online Community Practices
 
Human connectedness and meaningful conversations - how coaching boosts the su...
Human connectedness and meaningful conversations - how coaching boosts the su...Human connectedness and meaningful conversations - how coaching boosts the su...
Human connectedness and meaningful conversations - how coaching boosts the su...
 
YTH Live 2019 (youth+tech+health): Online to Offline Impact Measurement
YTH Live 2019 (youth+tech+health): Online to Offline Impact MeasurementYTH Live 2019 (youth+tech+health): Online to Offline Impact Measurement
YTH Live 2019 (youth+tech+health): Online to Offline Impact Measurement
 
How Nonprofits Open Professional Doors
How Nonprofits Open Professional DoorsHow Nonprofits Open Professional Doors
How Nonprofits Open Professional Doors
 
Collaboration and enterprise social tools-SharePointAlooza - 2015
Collaboration and enterprise social tools-SharePointAlooza - 2015Collaboration and enterprise social tools-SharePointAlooza - 2015
Collaboration and enterprise social tools-SharePointAlooza - 2015
 
3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx
3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx
3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx
 
3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx
3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx
3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx
 
Writing Paper Set Elegant Floral Print Stationery. Etsy
Writing Paper Set Elegant Floral Print Stationery.  EtsyWriting Paper Set Elegant Floral Print Stationery.  Etsy
Writing Paper Set Elegant Floral Print Stationery. Etsy
 

More from Peerasak C.

เสวนาเจาะเกณฑ์ระดมทุนและการลงทุนกับ LiVex
เสวนาเจาะเกณฑ์ระดมทุนและการลงทุนกับ LiVexเสวนาเจาะเกณฑ์ระดมทุนและการลงทุนกับ LiVex
เสวนาเจาะเกณฑ์ระดมทุนและการลงทุนกับ LiVex
Peerasak C.
 
Blockchain for Government Services
Blockchain for Government ServicesBlockchain for Government Services
Blockchain for Government Services
Peerasak C.
 
e-Conomy SEA 2020
e-Conomy SEA 2020e-Conomy SEA 2020
e-Conomy SEA 2020
Peerasak C.
 
A Roadmap for CrossBorder Data Flows: Future-Proofing Readiness and Cooperati...
A Roadmap for CrossBorder Data Flows: Future-Proofing Readiness and Cooperati...A Roadmap for CrossBorder Data Flows: Future-Proofing Readiness and Cooperati...
A Roadmap for CrossBorder Data Flows: Future-Proofing Readiness and Cooperati...
Peerasak C.
 
๙ พระสูตร ปฐมโพธิกาล
๙ พระสูตร ปฐมโพธิกาล๙ พระสูตร ปฐมโพธิกาล
๙ พระสูตร ปฐมโพธิกาล
Peerasak C.
 
TAT The Journey
TAT The JourneyTAT The Journey
TAT The Journey
Peerasak C.
 
FREELANCING IN AMERICA 2019
FREELANCING IN AMERICA 2019FREELANCING IN AMERICA 2019
FREELANCING IN AMERICA 2019
Peerasak C.
 
How to start a business: Checklist and Canvas
How to start a business: Checklist and CanvasHow to start a business: Checklist and Canvas
How to start a business: Checklist and Canvas
Peerasak C.
 
The Multiple Effects of Business Planning on New Venture Performance
The Multiple Effects of Business Planning on New Venture PerformanceThe Multiple Effects of Business Planning on New Venture Performance
The Multiple Effects of Business Planning on New Venture Performance
Peerasak C.
 
Artificial Intelligence and Life in 2030. Standford U. Sep.2016
Artificial Intelligence and Life in 2030. Standford U. Sep.2016Artificial Intelligence and Life in 2030. Standford U. Sep.2016
Artificial Intelligence and Life in 2030. Standford U. Sep.2016
Peerasak C.
 
Testing Business Ideas by David Bland & Alex Osterwalder
Testing Business Ideas by David Bland & Alex Osterwalder Testing Business Ideas by David Bland & Alex Osterwalder
Testing Business Ideas by David Bland & Alex Osterwalder
Peerasak C.
 
Royal Virtues by Somdet Phra Buddhaghosajahn (P. A. Payutto) translated by Ja...
Royal Virtues by Somdet Phra Buddhaghosajahn (P. A. Payutto) translated by Ja...Royal Virtues by Somdet Phra Buddhaghosajahn (P. A. Payutto) translated by Ja...
Royal Virtues by Somdet Phra Buddhaghosajahn (P. A. Payutto) translated by Ja...
Peerasak C.
 
e-Conomy SEA Report 2019
e-Conomy SEA Report 2019e-Conomy SEA Report 2019
e-Conomy SEA Report 2019
Peerasak C.
 
นิตยสารคิด (Creative Thailand) ฉบับเดือนตุลาคม 2562
นิตยสารคิด (Creative Thailand) ฉบับเดือนตุลาคม 2562นิตยสารคิด (Creative Thailand) ฉบับเดือนตุลาคม 2562
นิตยสารคิด (Creative Thailand) ฉบับเดือนตุลาคม 2562
Peerasak C.
 
เจาะเทรนด์โลก 2020: Positive Power โดย TCDC
เจาะเทรนด์โลก 2020: Positive Power โดย TCDCเจาะเทรนด์โลก 2020: Positive Power โดย TCDC
เจาะเทรนด์โลก 2020: Positive Power โดย TCDC
Peerasak C.
 
RD Strategies & D2rive
RD Strategies & D2rive RD Strategies & D2rive
RD Strategies & D2rive
Peerasak C.
 
คู่มือการใช้ระบบ Biz Portal ส าหรับผู้ยื่นคำขอ (Biz Portal User Manual)
คู่มือการใช้ระบบ Biz Portal ส าหรับผู้ยื่นคำขอ (Biz Portal User Manual)คู่มือการใช้ระบบ Biz Portal ส าหรับผู้ยื่นคำขอ (Biz Portal User Manual)
คู่มือการใช้ระบบ Biz Portal ส าหรับผู้ยื่นคำขอ (Biz Portal User Manual)
Peerasak C.
 
K SME INSPIRED September 2019 Issue 65
K SME INSPIRED September 2019 Issue 65K SME INSPIRED September 2019 Issue 65
K SME INSPIRED September 2019 Issue 65
Peerasak C.
 
หนังสือ BlockChain for Government Services
หนังสือ BlockChain for Government Servicesหนังสือ BlockChain for Government Services
หนังสือ BlockChain for Government Services
Peerasak C.
 
General Population Census of the Kingdom of Cambodia 2019
General Population Census of the Kingdom of Cambodia 2019General Population Census of the Kingdom of Cambodia 2019
General Population Census of the Kingdom of Cambodia 2019
Peerasak C.
 

More from Peerasak C. (20)

เสวนาเจาะเกณฑ์ระดมทุนและการลงทุนกับ LiVex
เสวนาเจาะเกณฑ์ระดมทุนและการลงทุนกับ LiVexเสวนาเจาะเกณฑ์ระดมทุนและการลงทุนกับ LiVex
เสวนาเจาะเกณฑ์ระดมทุนและการลงทุนกับ LiVex
 
Blockchain for Government Services
Blockchain for Government ServicesBlockchain for Government Services
Blockchain for Government Services
 
e-Conomy SEA 2020
e-Conomy SEA 2020e-Conomy SEA 2020
e-Conomy SEA 2020
 
A Roadmap for CrossBorder Data Flows: Future-Proofing Readiness and Cooperati...
A Roadmap for CrossBorder Data Flows: Future-Proofing Readiness and Cooperati...A Roadmap for CrossBorder Data Flows: Future-Proofing Readiness and Cooperati...
A Roadmap for CrossBorder Data Flows: Future-Proofing Readiness and Cooperati...
 
๙ พระสูตร ปฐมโพธิกาล
๙ พระสูตร ปฐมโพธิกาล๙ พระสูตร ปฐมโพธิกาล
๙ พระสูตร ปฐมโพธิกาล
 
TAT The Journey
TAT The JourneyTAT The Journey
TAT The Journey
 
FREELANCING IN AMERICA 2019
FREELANCING IN AMERICA 2019FREELANCING IN AMERICA 2019
FREELANCING IN AMERICA 2019
 
How to start a business: Checklist and Canvas
How to start a business: Checklist and CanvasHow to start a business: Checklist and Canvas
How to start a business: Checklist and Canvas
 
The Multiple Effects of Business Planning on New Venture Performance
The Multiple Effects of Business Planning on New Venture PerformanceThe Multiple Effects of Business Planning on New Venture Performance
The Multiple Effects of Business Planning on New Venture Performance
 
Artificial Intelligence and Life in 2030. Standford U. Sep.2016
Artificial Intelligence and Life in 2030. Standford U. Sep.2016Artificial Intelligence and Life in 2030. Standford U. Sep.2016
Artificial Intelligence and Life in 2030. Standford U. Sep.2016
 
Testing Business Ideas by David Bland & Alex Osterwalder
Testing Business Ideas by David Bland & Alex Osterwalder Testing Business Ideas by David Bland & Alex Osterwalder
Testing Business Ideas by David Bland & Alex Osterwalder
 
Royal Virtues by Somdet Phra Buddhaghosajahn (P. A. Payutto) translated by Ja...
Royal Virtues by Somdet Phra Buddhaghosajahn (P. A. Payutto) translated by Ja...Royal Virtues by Somdet Phra Buddhaghosajahn (P. A. Payutto) translated by Ja...
Royal Virtues by Somdet Phra Buddhaghosajahn (P. A. Payutto) translated by Ja...
 
e-Conomy SEA Report 2019
e-Conomy SEA Report 2019e-Conomy SEA Report 2019
e-Conomy SEA Report 2019
 
นิตยสารคิด (Creative Thailand) ฉบับเดือนตุลาคม 2562
นิตยสารคิด (Creative Thailand) ฉบับเดือนตุลาคม 2562นิตยสารคิด (Creative Thailand) ฉบับเดือนตุลาคม 2562
นิตยสารคิด (Creative Thailand) ฉบับเดือนตุลาคม 2562
 
เจาะเทรนด์โลก 2020: Positive Power โดย TCDC
เจาะเทรนด์โลก 2020: Positive Power โดย TCDCเจาะเทรนด์โลก 2020: Positive Power โดย TCDC
เจาะเทรนด์โลก 2020: Positive Power โดย TCDC
 
RD Strategies & D2rive
RD Strategies & D2rive RD Strategies & D2rive
RD Strategies & D2rive
 
คู่มือการใช้ระบบ Biz Portal ส าหรับผู้ยื่นคำขอ (Biz Portal User Manual)
คู่มือการใช้ระบบ Biz Portal ส าหรับผู้ยื่นคำขอ (Biz Portal User Manual)คู่มือการใช้ระบบ Biz Portal ส าหรับผู้ยื่นคำขอ (Biz Portal User Manual)
คู่มือการใช้ระบบ Biz Portal ส าหรับผู้ยื่นคำขอ (Biz Portal User Manual)
 
K SME INSPIRED September 2019 Issue 65
K SME INSPIRED September 2019 Issue 65K SME INSPIRED September 2019 Issue 65
K SME INSPIRED September 2019 Issue 65
 
หนังสือ BlockChain for Government Services
หนังสือ BlockChain for Government Servicesหนังสือ BlockChain for Government Services
หนังสือ BlockChain for Government Services
 
General Population Census of the Kingdom of Cambodia 2019
General Population Census of the Kingdom of Cambodia 2019General Population Census of the Kingdom of Cambodia 2019
General Population Census of the Kingdom of Cambodia 2019
 

Recently uploaded

The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
DhatriParmar
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
Wasim Ak
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Dr. Vinod Kumar Kanvaria
 
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBCSTRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
kimdan468
 
Marketing internship report file for MBA
Marketing internship report file for MBAMarketing internship report file for MBA
Marketing internship report file for MBA
gb193092
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
thanhdowork
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdfMASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
goswamiyash170123
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
Krisztián Száraz
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 

Recently uploaded (20)

The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
 
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBCSTRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
 
Marketing internship report file for MBA
Marketing internship report file for MBAMarketing internship report file for MBA
Marketing internship report file for MBA
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdfMASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 

MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman

  • 2. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Deep Reinforcement Learning (Deep RL) • What is it? Framework for learning to solve sequential decision making problems. • How? Trial and error in a world that provides occasional rewards • Deep? Deep RL = RL + Neural Networks [306, 307] Deep Learning Deep RL
  • 3. 2019https://deeplearning.mit.edu Types of Learning • Supervised Learning • Semi-Supervised Learning • Unsupervised Learning • Reinforcement Learning It’s all “supervised” by a loss function! *Someone has to say what’s good and what’s bad (see Socrates, Epictetus, Kant, Nietzsche, etc.) Neural Network Input Output Good or Bad? Supervision*
  • 4. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Supervised learning is “teach by example”: Here’s some examples, now learn patterns in these example. Reinforcement learning is “teach by experience”: Here’s a world, now learn patterns by exploring it.
  • 5. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Machine Learning: Supervised vs Reinforcement Supervised learning is “teach by example”: Here’s some examples, now learn patterns in these example. Reinforcement learning is “teach by experience”: Here’s a world, now learn patterns by exploring it. [330] Failure Success
  • 6. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Reinforcement Learning in Humans • Human appear to learn to walk through “very few examples” of trial and error. How is an open question… • Possible answers: • Hardware: 230 million years of bipedal movement data. • Imitation Learning: Observation of other humans walking. • Algorithms: Better than backpropagation and stochastic gradient descent [286, 291]
  • 7. 2019https://deeplearning.mit.edu Open Question: What can be learned from data? Environment Sensors Sensor Data Feature Extraction Machine Learning Reasoning Knowledge Effector Planning Action Representation
  • 8. 2019https://deeplearning.mit.edu Environment Sensors Sensor Data Feature Extraction Machine Learning Reasoning Knowledge Effector Planning Action Representation References: [132] Lidar Camera (Visible, Infrared) Radar Stereo Camera Microphone GPS IMU Networking (Wired, Wireless)
  • 9. 2019https://deeplearning.mit.edu Environment Sensors Sensor Data Feature Extraction Machine Learning Reasoning Knowledge Effector Planning Action Representation
  • 10. 2019https://deeplearning.mit.edu Environment Sensors Sensor Data Feature Extraction Machine Learning Reasoning Knowledge Effector Planning Action Representation
  • 11. 2019https://deeplearning.mit.edu Environment Sensors Sensor Data Feature Extraction Machine Learning Reasoning Knowledge Effector Planning Action Representation Image Recognition: If it looks like a duck Activity Recognition: Swims like a duck Audio Recognition: Quacks like a duck
  • 12. 2019https://deeplearning.mit.edu Environment Sensors Sensor Data Feature Extraction Machine Learning Reasoning Knowledge Effector Planning Action Representation Final breakthrough, 358 years after its conjecture: “It was so indescribably beautiful; it was so simple and so elegant. I couldn’t understand how I’d missed it and I just stared at it in disbelief for twenty minutes. Then during the day I walked around the department, and I’d keep coming back to my desk looking to see if it was still there. It was still there. I couldn’t contain myself, I was so excited. It was the most important moment of my working life. Nothing I ever do again will mean as much."
  • 13. 2019https://deeplearning.mit.edu Environment Sensors Sensor Data Feature Extraction Machine Learning Reasoning Knowledge Effector Planning Action Representation References: [133]
  • 14. 2019https://deeplearning.mit.edu Environment Sensors Sensor Data Feature Extraction Machine Learning Reasoning Knowledge Effector Planning Action Representation The promise of Deep Learning The promise of Deep Reinforcement Learning
  • 15. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Reinforcement Learning Framework At each step, the agent: • Executes action • Observe new state • Receive reward [80] Open Questions: • What cannot be modeled in this way? • What are the challenges of learning in this framework?
  • 16. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Environment and Actions • Fully Observable (Chess) vs Partially Observable (Poker) • Single Agent (Atari) vs Multi Agent (DeepTraffic) • Deterministic (Cart Pole) vs Stochastic (DeepTraffic) • Static (Chess) vs Dynamic (DeepTraffic) • Discrete (Chess) vs Continuous (Cart Pole) Note: Real-world environment might not technically be stochastic or partially-observable but might as well be treated as such due to their complexity.
  • 17. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references The Challenge for RL in Real-World Applications Reminder: Supervised learning: teach by example Reinforcement learning: teach by experience Open Challenges. Two Options: 1. Real world observation + one-shot trial & error 2. Realistic simulation + transfer learning 1. Improve Transfer Learning 2. Improve Simulation
  • 18. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Major Componentsof anRLAgent An RL agent may be directly or indirectly trying to learn a: • Policy: agent’s behavior function • Value function: how good is each state and/or action • Model: agent’s representation of the environment 𝑠0, 𝑎0, 𝑟1, 𝑠1, 𝑎1, 𝑟2, … , 𝑠 𝑛−1, 𝑎 𝑛−1, 𝑟 𝑛, 𝑠 𝑛 state action reward Terminal state
  • 19. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Meaning of Life for RL Agent: Maximize Reward [84] • Future reward: 𝑅𝑡 = 𝑟𝑡 + 𝑟𝑡+1 + 𝑟𝑡+2 + ⋯ + 𝑟𝑛 • Discounted future reward: 𝑅 𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟 𝑡+2 + ⋯ + 𝛾 𝑛−𝑡 𝑟 𝑛 • A good strategy for an agent would be to always choose an action that maximizes the (discounted) future reward • Why “discounted”? • Math trick to help analyze convergence • Uncertainty due to environment stochasticity, partial observability, or that life can end at any moment: “If today were the last day of my life, would I want to do what I’m about to do today?” – Steve Jobs
  • 20. 2019https://deeplearning.mit.edu actions: UP, DOWN, LEFT, RIGHT (Stochastic) model of the world: Action: UP 80% move UP 10% move LEFT 10% move RIGHT Robot in a Room +1 -1 START • Reward +1 at [4,3], -1 at [4,2] • Reward -0.04 for each step • What’s the strategy to achieve max reward? • We can learn the model and plan • We can learn the value of (action, state) pairs and act greed/non-greedy • We can learn the policy directly while sampling from it
  • 21. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Optimal Policy for a Deterministic World Reward: -0.04 for each step actions: UP, DOWN, LEFT, RIGHT When actions are deterministic: UP 100% move UP 0% move LEFT 0% move RIGHT Policy: Shortest path. +1 -1
  • 22. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Optimal Policy for a Stochastic World Reward: -0.04 for each step +1 -1 actions: UP, DOWN, LEFT, RIGHT When actions are stochastic: UP 80% move UP 10% move LEFT 10% move RIGHT Policy: Shortest path. Avoid -UP around -1 square.
  • 23. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Optimal Policy for a Stochastic World Reward: -2 for each step +1 -1 actions: UP, DOWN, LEFT, RIGHT When actions are stochastic: UP 80% move UP 10% move LEFT 10% move RIGHT Policy: Shortest path.
  • 24. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Optimal Policy for a Stochastic World Reward: -0.1 for each step +1 -1 +1 -1 Reward: -0.04 for each step Less urgentMore urgent
  • 25. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Optimal Policy for a Stochastic World Reward: +0.01 for each step actions: UP, DOWN, LEFT, RIGHT When actions are stochastic: UP 80% move UP 10% move LEFT 10% move RIGHT Policy: Longest path. +1 -1
  • 26. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Lessons from Robot in Room • Environment model has big impact on optimal policy • Reward structure has big impact on optimal policy
  • 27. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Reward structure may have Unintended Consequences Player gets reward based on: 1. Finishing time 2. Finishing position 3. Picking up “turbos” [285] Human AI (Deep RL Agent)
  • 28. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references AI Safety Risk (and thus Human Life) Part of the Loss Function
  • 29. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Examples of Reinforcement Learning Cart-Pole Balancing • Goal— Balance the pole on top of a moving cart • State— Pole angle, angular speed. Cart position, horizontal velocity. • Actions— horizontal force to the cart • Reward— 1 at each time step if the pole is upright [166]
  • 30. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Examples of Reinforcement Learning Doom* • Goal: Eliminate all opponents • State: Raw game pixels of the game • Actions: Up, Down, Left, Right, Shoot, etc. • Reward: • Positive when eliminating an opponent, negative when the agent is eliminated [166] * Added for important thought-provoking considerations of AI safety in the context of autonomous weapons systems (see AGI lectures on the topic).
  • 31. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Examples of Reinforcement Learning Grasping Objects with Robotic Arm • Goal - Pick an object of different shapes • State - Raw pixels from camera • Actions – Move arm. Grasp. • Reward - Positive when pickup is successful [332]
  • 32. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references Examples of Reinforcement Learning Human Life • Goal - Survival? Happiness? • State - Sight. Hearing. Taste. Smell. Touch. • Actions - Think. Move. • Reward – Homeostasis?
  • 33. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references Key Takeaways for Real-World Impact • Deep Learning: • Fun part: Good algorithms that learn from data. • Hard part: Good questions, huge amounts of representative data. • Deep Reinforcement Learning: • Fun part: Good algorithms that learn from data. • Hard part: Defining a useful state space, action space, and reward. • Hardest part: Getting meaningful data for the above formalization.
  • 34. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references 3 Types of Reinforcement Learning Model-based • Learn the model of the world, then plan using the model • Update model often • Re-plan often [331, 333] Value-based • Learn the state or state-action value • Act by choosing best action in state • Exploration is a necessary add-on Policy-based • Learn the stochastic policy function that maps state to action • Act by sampling policy • Exploration is baked in
  • 35. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Taxonomy of RL Methods [334] Link: https://spinningup.openai.com
  • 36. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Taxonomy of RL Methods [334]
  • 37. 2019https://deeplearning.mit.edu Q-Learning • State-action value function: Q(s,a) • Expected return when starting in s, performing a, and following  • Q-Learning: Use any policy to estimate Q that maximizes future reward: • Q directly approximates Q* (Bellman optimality equation) • Independent of the policy being followed • Only requirement: keep updating each (s,a) pair s a s’ r New State Old State Reward Learning Rate Discount Factor
  • 38. 2019https://deeplearning.mit.edu Exploration vs Exploitation • Deterministic/greedy policy won’t explore all actions • Don’t know anything about the environment at the beginning • Need to try all actions to find the optimal one • ε-greedy policy • With probability 1-ε perform the optimal/greedy action, otherwise random action • Slowly move it towards greedy policy: ε -> 0
  • 39. 2019https://deeplearning.mit.edu Q-Learning: Value Iteration References: [84] A1 A2 A3 A4 S1 +1 +2 -1 0 S2 +2 0 +1 -2 S3 -1 +1 0 -2 S4 -2 0 +1 +1
  • 40. 2019https://deeplearning.mit.edu Q-Learning: Representation Matters • In practice, Value Iteration is impractical • Very limited states/actions • Cannot generalize to unobserved states • Think about the Breakout game • State: screen pixels • Image size: 𝟖𝟒 × 𝟖𝟒 (resized) • Consecutive 4 images • Grayscale with 256 gray levels 𝟐𝟓𝟔 𝟖𝟒×𝟖𝟒×𝟒 rows in theQ-table! = 1069,970 >> 1082 atoms in the universe References: [83, 84]
  • 41. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Deep RL = RL + Neural Networks [20]
  • 42. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references DQN: Deep Q-Learning [83] Use a neural network to approximate the Q-function:
  • 43. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references Deep Q-Network (DQN): Atari [83] Mnih et al. "Playing atari with deep reinforcement learning." 2013.
  • 44. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references DQN and Double DQN • Loss function (squared error): [83] predictiontarget • DQN: same network for both Q • Double DQN: separate network for each Q • Helps reduce bias introduced by the inaccuracies of Q network at the beginning of training
  • 45. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references DQN Tricks • Experience Replay • Stores experiences (actions, state transitions, and rewards) and creates mini-batches from them for the training process • Fixed Target Network • Error calculation includes the target function depends on network parameters and thus changes quickly. Updating it only every 1,000 steps increases stability of training process. [83, 167]
  • 46. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references Atari Breakout [85] After 120 Minutes of Training After 10 Minutes of Training After 240 Minutes of Training
  • 47. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references DQN Resultsin Atari [83]
  • 48. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Dueling DQN (DDQN) • Decompose Q(s,a) • V(s): the value of being at that state • A(s,a): the advantage of taking action a in state s versus all other possible actions at that state • Use two streams: • one that estimates the state value V(s) • one that estimates the advantage for each action A(s,a) • Useful for states where action choice does not affect Q(s,a) [336]
  • 49. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Prioritized Experience Replay • Sample experiences based on impact not frequency of occurrence [336]
  • 50. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Taxonomy of RL Methods [334]
  • 51. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references Policy Gradient (PG) • DQN (off-policy): Approximate Q and infer optimal policy • PG (on-policy): Directly optimize policy space [63] Good illustrative explanation: http://karpathy.github.io/2016/05/31/rl/ “Deep Reinforcement Learning: Pong from Pixels” Policy Network
  • 52. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references Policy Gradient – Training [63, 204] • REINFORCE: Policy gradient that increases probability of good actions and decreases probability of bad action:
  • 53. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Policy Gradient • Pros vs DQN: • Messy World: If Q function is too complex to be learned, DQN may fail miserably, while PG will still learn a good policy. • Speed: Faster convergence • Stochastic Policies: Capable of learning stochastic policies - DQN can’t • Continuous actions: Much easier to model continuous action space • Cons vs DQN: • Data: Sample inefficient (needs more data) • Stability: Less stable during training process. • Poor credit assignment to (state, action) pairs for delayed rewards [63, 335] Problem with REINFORCE: Calculating the reward at the end, means all the actions will be averaged as good because the total reward was high.
  • 54. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Taxonomy of RL Methods [334]
  • 55. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Advantage Actor-Critic (A2C) • Combine DQN (value-based) and REINFORCE (policy-based) • Two neural networks (Actor and Critic): • Actor is policy-based: Samples the action from a policy • Critic is value-based: Measures how good the chosen action is [335] • Update at each time step - temporal difference (TD) learning
  • 56. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Asynchronous Advantage Actor-Critic (A3C) [337] • Both use parallelism in training • A2C syncs up for global parameter update and then start each iteration with the same policy
  • 57. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Taxonomy of RL Methods [334]
  • 58. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Deep Deterministic Policy Gradient (DDPG) • Actor-Critic framework for learning a deterministic policy • Can be thought of as: DQN for continuous action spaces • As with all DQN, following tricks are required: • Experience Replay • Target network • Exploration: add noise to actions, reducing scale of the noise as training progresses [341]
  • 59. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Taxonomy of RL Methods [334]
  • 60. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Policy Optimization • Progress beyond Vanilla Policy Gradient: • Natural Policy Gradient • TRPO • PPO • Basic idea in on-policy optimization: Avoid taking bad actions that collapse the training performance. [339] Line Search: First pick direction, then step size Trust Region: First pick step size, then direction
  • 61. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Taxonomy of RL Methods [334]
  • 62. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references Game of Go [170]
  • 63. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references AlphaGo (2016) Beat Top Human at Go [83]
  • 64. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references AlphaGo Zero (2017): Beats AlphaGo [149]
  • 65. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references AlphaGo Zero Approach [170] • Same as the best before: Monte Carlo Tree Search (MCTS) • Balance exploitation/exploration (going deep on promising positions or exploring new underplayed positions) • Use a neural network as “intuition” for which positions to expand as part of MCTS (same as AlphaGo)
  • 66. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references AlphaGo Zero Approach [170] • Same as the best before: Monte Carlo Tree Search (MCTS) • Balance exploitation/exploration (going deep on promising positions or exploring new underplayed positions) • Use a neural network as “intuition” for which positions to expand as part of MCTS (same as AlphaGo) • “Tricks” • Use MCTS intelligent look-ahead (instead of human games) to improve value estimates of play options • Multi-task learning: “two-headed” network that outputs (1) move probability and (2) probability of winning. • Updated architecture: use residual networks
  • 67. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references AlphaZero vs Chess, Shogi, Go [340]
  • 68. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references AlphaZero vs Chess, Shogi, Go [340]
  • 69. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references AlphaZero vs Chess, Shogi, Go [340]
  • 70. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references To date, for most successful robots operating in the real world: Deep RL is not involved [169]
  • 71. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references To date, for most successful robots operating in the real world: Deep RL is not involved
  • 72. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references But… that’s slowly changing: Learning Control Dynamics [343]
  • 73. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references But… that’s slowly changing: Learning to Drive: Beyond Pure Imitation (Waymo) [343]
  • 74. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references The Challenge for RL in Real-World Applications Reminder: Supervised learning: teach by example Reinforcement learning: teach by experience Open Challenges. Two Options: 1. Real world observation + one-shot trial & error 2. Realistic simulation + transfer learning 1. Improve Transfer Learning 2. Improve Simulation
  • 75. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Thinking Outside the Box: Multiverse Theory and the Simulation Hypothesis • Create an (infinite) set of simulation environments to learn in so that our reality becomes just another sample from the set. [342]
  • 76. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Next Steps in Deep RL • Lectures: https://deeplearning.mit.edu • Tutorials: https://github.com/lexfridman/mit-deep-learning • Advice for Research (from Spinning Up as a Deep RL Researcher by Joshua Achiam) • Background • Fundamentals in probability, statistics, multivariate calculus. • Deep learning basics • Deep RL basics • TensorFlow (or PyTorch) • Learn by doing • Implement core deep RL algorithms (discussed today) • Look for tricks and details in papers that were key to get it to work • Iterate fast in simple environments (see Einstein quote on simplicity) • Research • Improve on an existing approach • Focus on an unsolved task / benchmark • Create a new task / problem that hasn’t been addressed with RL [339]