#StandardsGoals for 2024: Whatβs new for BISAC - Tech Forum 2024
Β
Human level control through deep rl
1. Human-level Control
Through Deep
Reinforcement Learning
Volodymyr Mnih, Koray Kavukcuoglu, David
Silver, Andrei A. Rusu, etc.
Project Group: Qingyuan Feng, Jian Jin, Saad
Mahboob, Rui Wang
2. Acknowledgements
βΊ This presentation is partially adopted from the presentations by:
βΊ Dong-Kyoung Kye, available at: vi.snu.ac.kr/xe
βΊ Jiang Guo, available at:
http://ir.hit.edu.cn/~jguo/docs/notes/dqn-atari.pdf
3. Outline
βΊ Motivation & Reinforcement Learning
βΊ Model Description
β Q-Learning
β Q-network
β Training Q-network
β Innovations of the model
βΊ Project outline
4. Motivation
βΊ Previously the game-playing agents are highly specific to the
game
βΊ Goal: creating an AI agent capable of playing a wide range of
games, one step closer to Jarvis or R2D2
https://www.youtube.com/watch?v=cqXbjyWrdSo https://twitter.com/r2d2__starwars
5. Reinforcement Learning
βΊ Want to teach the agent to play games
βΊ Supervised learning: let expert players play for 100,000 times
βΊ Reinforcement learning is the choice
9. Model components
βΊ States:
ππ‘ = π₯1, π1, π₯2, β¦ , π π‘β1, π₯ π‘
πβπππ π₯ π‘ ππ πππ₯ππ π£πππ’ππ ππ‘ π‘πππ π‘
βΊ Value function: discounted future reward
βΊ Policy: π, mapping from state to action
βΊ Goal: maximize value function (discounted future reward)
10. Q-Learning
βΊ Q function: maximum discounted future reward
π π , π = πππ₯ (π π‘|π π‘ = π , π π‘ = π)
βΊ Q function represents the βqualityβ of a certain action in a given
state.
βΊ Iterative calculation: Bellman equation
π π , π = π + πΎπππ₯ πβ² π(π β², πβ²)
βΊ In practice, value iteration is impractical
β Specific to each sequence s and action a, canβt generalize
11. Q-network
βΊ Use a function approximator to estimate the action-value
function
βΊ Neural network with weight π as the approximator, called Q-
network
βΊ Input/Output:
State Network
Q value of
Action 1
Q value of
Action 2
Q value of
Action 3
13. Training Q-Network
βΊ Loss function: mean squared error (MSE)
βΊ Derivatives w.r.t. the weights:
βΊ Using mini-batch SGD
14. Innovations: Experience Replay
βΊ Break temporal correlations
βΊ Better utilize experience
βΊ Choose action π π‘ according to π-greedy policy
β Choose best action with probability 1 β π, randomly with prob. π
βΊ Store transition (π π‘, π π‘, ππ‘, π π‘+1) in replay memory D
βΊ Sample mini-batch of transitions (π , π, π, π β²) from D
βΊ Minimize MSE between Q-network and Q-learning targets
15. Innovations: separate target network
βΊ A separate target network having the same structure
βΊ Compute Q-learning targets using less frequently update
parameters ππ
β
instead of ππ of the training network
βΊ Optimize between Q-network and Q-learning targets:
βΊ Periodically update ππ
β
to values of ππ
21. How to Handle Continuous Control?
Name Range (units) Description
ob.angle [-Ο,+Ο]
Angle between the car direction and the direction of the
track axis
ob.track (0, 200)(meters)
Vector of 19 range finder sensors: each sensor returns the
distance between the track edge and the car within a range
of 200 meters
ob.trackPos (-β,+β)
Distance between the car and the track axis. The value is
normalized w.r.t. to the track width: it is 0 when the car is
on the axis, values greater than 1 or -1 means the car is
outside of the track.
ob.speedX (-β,+β)(km/h)
Speed of the car along the longitudinal axis of the car
(good velocity)
ob.speedY (-β,+β)(km/h) Speed of the car along the transverse axis of the car
ob.speedZ (-β,+β)(km/h) Speed of the car along the Z-axis of the car
ob.wheelSpinVel (0,+β)(rad/s)
Vector of 4 sensors representing the rotation speed of
wheels
ob.rpm (0,+β)(rpm) Number of rotation per minute of the car engine
22. How to Handle Continuous Control?
βΊ The DQN is designed for the discrete output
βΊ The Continuous Output is High dimensional
26. How to Choose Attention
βΊ The Attention Model would reduce the dimensionalities of
features obtained from images
βΊ Convolutional Local Features may be enough for the decisions
βΊ Supervised Signals are correlated with environment it exploits or
explores
27. Project Plan
βΊ Utilize CNN to get temporal information
βΊ Add Probabilistic Mixture Model Layer for Convolutional
Features
βΊ Develop two network architectures to process the temporal and
convolutional features