SlideShare a Scribd company logo
Playing Atari with Deep
Reinforcement Learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller.
NIPS Deep Learning Workshop 2013
Yu Kai Huang
Outline
● Reinforcement Learning
● Markov Decision Process
○ State, Action(Policy), Reward
○ Value function, Bellman Equation
● Optimal Policy
○ Bellman Optimality Equation
○ Q-learning
○ Deep Q-learning Network
● Experiments
○ Training and Stability
○ Evaluation
Reinforcement Learning
Reinforcement Learning
Image from https://arxiv.org/pdf/1312.5602.pdf
Reinforcement Learning
Image from https://i.imgur.com/kw5Veqz.jpg
Reinforcement Learning
Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf
Reinforcement Learning
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/intro_RL.pdf
● No supervisor, only a reward
signal.
● Feedback is delayed, not
instantaneous.
● Time really matters.
● Agent’s actions affect the
subsequent data it receives.
Reinforcement Learning
Image from
https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/1-1-A-RL/
Reinforcement Learning
● State: the current situation that
the agent is in.
○ e.g. moving (position,
velocity, acceleration,...)
Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf
Reinforcement Learning
● State: the current situation that
the agent is in.
○ e.g. moving (position,
velocity, acceleration,...)
● Action: a command that agent
can give in the game.
○ e.g. ↑, ↓, ←, →
Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf
Reinforcement Learning
● State: the current situation that
the agent is in.
○ e.g. moving (position,
velocity, acceleration,...)
● Action: a command that agent
can give in the game.
○ e.g. ↑, ↓, ←, →
● Reward: given after performing
an action.
○ e.g. +1, -100
Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf
Reinforcement Learning
● Full observability: agent directly
observes environment state.
● Agent state = environment state
= information state
● Formally, this is a Markov
Decision Process (MDP).
Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf
Markov Decision Process
Markov Decision Process
● Markov decision processes formally describe an environment for
reinforcement learning.
● Where the environment is fully observable.
● Almost all RL problems can be formalised as MDPs.
Markov Decision Process: State
● An MDP is a directed graph which has states for its nodes and edges which
describe transitions between Markov states.
○ State Transition Matrix
● Markov Property: “The future is independent of the past given the present”
○ The current state summarizes all past states.
○ e.g., if we only know the position of the ball but not its velocity, its state is
no longer Markov.
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
Example: Student MDP
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
Markov Decision Process: Policy
● A policy fully defines the behaviour of an agent.
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
Markov Decision Process: Reward and Return
● Each time you make a transition into a state, you receive a reward.
● Agents should learn to maximize cumulative future reward.
○ Return
○ Discount factor
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
Markov Decision Process: Value function
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
Markov Decision Process: Bellman Equation
● if we know the value of the next state, we can know the value of the current
state.
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
Markov Decision Process: Bellman Equation
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf,
https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf
Markov Decision Process: Bellman Equation
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf,
https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf
Markov Decision Process: Bellman Equation
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf,
https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf
Example: Student MDP
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdff
Optimal Policy
Optimal Policy
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdff
Optimal Value function
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdff
Optimal Value function
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdff
Bellman Optimality Equation for Q*
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdff
The solution method of Bellman
Optimality Equation
Example: How to be a good kid?
Image from https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/2-2-A-q-learning/
Q-Learning
● Q-table
a1(Watch TV) a2(Do homework)
s1 0 0
s2 0 0
s3 0 0
● Current state: s1
● Select Action:
○ a1 = argmax(q(s1, a1), q(s1, a2))
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
Set discount factor r = 1, alpha = 1
target(s, a) =
Q-Learning
● Q-table
a1(Watch TV) a2(Do homework)
s1 0 0
s2 0 0
s3 0 0
● Current state: s1
● Select Action:
○ a1 = argmax(q(s1, a1), q(s1, a2))
● Next state: s1
● Reward:
○ R(s1, a1) = -5
● Q-table:
○ delta = target(s1, a1) - q(s1, a1)
= (-5+1*(0)) - 0 = -5
○ q(s1, a1) = q(s1, a1) + alpha*delta
= 0 + 1*(-5) = -5
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
Set discount factor r = 1, alpha = 1
target(s, a) =
Q-Learning
● Q-table
a1(Watch TV) a2(Do homework)
s1 -5 0
s2 0 0
s3 0 0
● Current state: s1
● Select Action:
○ a1 = argmax(q(s1, a1), q(s1, a2))
● Next state: s1
● Reward:
○ R(s1, a1) = -5
● Q-table:
○ delta = target(s1, a1) - q(s1, a1)
= (-5+1*(0)) - 0 = -5
○ q(s1, a1) = q(s1, a1) + alpha*delta
= 0 + 1*(-5) = -5
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
Set discount factor r = 1, alpha = 1
target(s, a) =
Q-Learning
● Q-table
a1(Watch TV) a2(Do homework)
s1 -5 0
s2 0 0
s3 0 0
● Current state: s1
● Select Action:
○ a2 = argmax(q(s1, a1), q(s1, a2))
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
Set discount factor r = 1, alpha = 1
target(s, a) =
Q-Learning
● Q-table
a1(Watch TV) a2(Do homework)
s1 -5 0
s2 0 0
s3 0 0
● Current state: s1
● Select Action:
○ a2 = argmax(q(s1, a1), q(s1, a2))
● Next state: s2
● Reward:
○ R(s1, a2) = 1
● Q-table:
○ delta = target(s1, a2) - q(s1, a2)
= (1+1*(0)) - 0= 1
○ q(s1, a2) = q(s1, a2) + alpha*delta
= 0 + 1*1 = 1
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
Set discount factor r = 1, alpha = 1
target(s, a) =
Q-Learning
● Q-table
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 0 0
s3 0 0
● Current state: s1
● Select Action:
○ a2 = argmax(q(s1, a1), q(s1, a2))
● Next state: s2
● Reward:
○ R(s1, a2) = 1
● Q-table:
○ delta = target(s1, a2) - q(s1, a2)
= (1+1*(0)) - 0= 1
○ q(s1, a2) = q(s1, a2) + alpha*delta
= 0 + 1*1 = 1
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
Set discount factor r = 1, alpha = 1
target(s, a) =
Q-Learning
● Q-table
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 0 0
s3 0 0
● Current state: s2
● Select Action:
○ a1 = argmax(q(s2, a1), q(s2, a2))
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
Set discount factor r = 1, alpha = 1
target(s, a) =
Q-Learning
● Q-table
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 0 0
s3 0 0
● Current state: s2
● Select Action:
○ a1 = argmax(q(s2, a1), q(s2, a2))
● Next state: s1
● Reward:
○ R(s2, a1) = -5
● Q-table:
○ delta = target(s2, a1) - q(s2, a1)
= (-5+1*1) - 0= -4
○ q(s2, a1) = q(s2, a1) + alpha*delta
= 0 + 1*(-4) = -4
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
Set discount factor r = 1, alpha = 1
target(s, a) =
Q-Learning
● Q-table
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 -4 0
s3 0 0
● Current state: s2
● Select Action:
○ a1 = argmax(q(s2, a1), q(s2, a2))
● Next state: s1
● Reward:
○ R(s2, a1) = -5
● Q-table:
○ delta = target(s2, a1) - q(s2, a1)
= (-5+1*1) - 0= -4
○ q(s2, a1) = q(s2, a1) + alpha*delta
= 0 + 1*(-4) = -4
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
Set discount factor r = 1, alpha = 1
target(s, a) =
Q-Learning
Select Action
Image from https://blog.techbridge.cc/2017/11/04/openai-gym-intro-and-q-learning/
Q-Learning
target(s, a)
Image from https://blog.techbridge.cc/2017/11/04/openai-gym-intro-and-q-learning/
Deep Q-learning Network
● Data Preprocessing: “The raw frames are preprocessed by first converting
their RGB representation to gray-scale and down-sampling it to a 110×84
[...] cropping an 84 × 84 region of the image [...].”
● Model Architecture
○ Input size: 84x84x4
○ Ouput size: 4 (←, →, x, B)
○ layers:
■ conv1(16, (8, 8), strides=(4, 4))
■ conv2(32, (4, 4), strides=(2, 2))
■ Dense(256)
■ Dense(4)
Image from https://becominghuman.ai/lets-build-an-atari-ai-part-0-intro-to-rl-9b2c5336e0ec
Deep Q-learning Network
● Experience Replay
○ “we store the agent’s experiences at each time-step, et = (st, at, rt, st+1)
in a data-set D = e1, ..., eN , pooled over many episodes into a replay
memory.”
Deep Q-learning Network
● Optimal action-value function
Deep Q-learning Network
Image from https://arxiv.org/pdf/1312.5602.pdf
Experiments
Training and Stability (Evaluation Metric)
Image from https://arxiv.org/pdf/1312.5602.pdf
Training and Stability (Evaluation Metric)
Image from https://arxiv.org/pdf/1312.5602.pdf
Visualizing the Value Function
Image from https://arxiv.org/pdf/1312.5602.pdf
Main Evaluation
● A trial: 5,000 training episodes, followed by 500 evaluation episodes.
● Average performance across 30 trials.
Image from https://arxiv.org/pdf/1312.5602.pdf
Ref.
[1] Jaromír Janisch: LET’S MAKE A DQN: THEORY https://jaromiru.com/2016/09/27/lets-make-a-dqn-theory/#fn-38-6
[2] Venelin Valkov: Solving an MDP with Q-Learning from scratch — Deep Reinforcement Learning for Hackers (Part 1)
https://medium.com/@curiousily/solving-an-mdp-with-q-learning-from-scratch-deep-reinforcement-learning-for-hackers-part-
1-45d1d360c120
[3] Flood Sung: Deep Reinforcement Learning 基础知识(DQN方面 https://blog.csdn.net/songrotek/article/details/50580904
[4] Flood Sung: 增强学习Reinforcement Learning经典算法梳理1:policy and value iteration
https://blog.csdn.net/songrotek/article/details/51378582
[5] mmc2015: reinforcement learning,增强学习:Policy Evaluation,Policy Iteration,Value Iteration,Dynamic Programming
https://blog.csdn.net/mmc2015/article/details/52859611
Ref.
[6] Gai's Blog: 增强学习 Reinforcement learning part 1 - Introduction https://bluesmilery.github.io/blogs/481fe3af/
[7] Gai's Blog: 增强学习 Reinforcement learning part 2 - Markov Decision Process
https://bluesmilery.github.io/blogs/e4dc3fbf/
[8] Gai's Blog: 增强学习 Reinforcement learning part 3 - Planning by Dynamic Programming
https://bluesmilery.github.io/blogs/b96003ba/
[9] Rowan McAllister: Introduction to Reinforcement Learning http://mlg.eng.cam.ac.uk/rowan/files/rl/01_mdps.pdf
[10] Adrien Lucas Ecoffet: Beat Atari with Deep Reinforcement Learning! (Part 0: Intro to RL)
https://becominghuman.ai/lets-build-an-atari-ai-part-0-intro-to-rl-9b2c5336e0ec
[11] Joshgreaves: Everything You Need to Know to Get Started in Reinforcement Learning
https://joshgreaves.com/reinforcement-learning/introduction-to-reinforcement-learning/
[12] Katerina Fragkiadaki: Deep Reinforcement Learning and Control: Markov Decision Processes
https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf
Ref.
[13] 莫烦: 什么是 Q Leaning
https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/2-2-A-q-learning/

More Related Content

Similar to Playing Atari with Deep Reinforcement Learning

CSE460 Lecture 2.pptx.pdf
CSE460 Lecture 2.pptx.pdfCSE460 Lecture 2.pptx.pdf
CSE460 Lecture 2.pptx.pdf
FarhanFaruk3
 
Chapter02b
Chapter02bChapter02b
Chapter02b
Tianlu Wang
 
Chapter06
Chapter06Chapter06
Chapter06
Tianlu Wang
 
Chapter06
Chapter06Chapter06
Chapter06
Tianlu Wang
 
MachineLearning_QLearningCircuit
MachineLearning_QLearningCircuitMachineLearning_QLearningCircuit
MachineLearning_QLearningCircuit
Sean Williams
 
Notes on Reinforcement Learning - v0.1
Notes on Reinforcement Learning - v0.1Notes on Reinforcement Learning - v0.1
Notes on Reinforcement Learning - v0.1
Joo-Haeng Lee
 
Problem statement mathematical foundations
Problem statement mathematical foundationsProblem statement mathematical foundations
Problem statement mathematical foundations
Bangaluru
 
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
The Statistical and Applied Mathematical Sciences Institute
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Shahan Ali Memon
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark
DB Tsai
 
Alex1 group2
Alex1 group2Alex1 group2
Alex1 group2
Shiang-Yun Yang
 
DAA_LECT_2.pdf
DAA_LECT_2.pdfDAA_LECT_2.pdf
DAA_LECT_2.pdf
AryanSaini69
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
Ben Ball
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
Kuppusamy P
 
2018 MUMS Fall Course - Statistical and Mathematical Techniques for Sensitivi...
2018 MUMS Fall Course - Statistical and Mathematical Techniques for Sensitivi...2018 MUMS Fall Course - Statistical and Mathematical Techniques for Sensitivi...
2018 MUMS Fall Course - Statistical and Mathematical Techniques for Sensitivi...
The Statistical and Applied Mathematical Sciences Institute
 
Efficient Scalar Multiplication for Ate Based Pairing over KSS Curve of Embed...
Efficient Scalar Multiplication for Ate Based Pairing over KSS Curve of Embed...Efficient Scalar Multiplication for Ate Based Pairing over KSS Curve of Embed...
Efficient Scalar Multiplication for Ate Based Pairing over KSS Curve of Embed...
Md. Al-Amin Khandaker Nipu
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep Learning
Khang Pham
 

Similar to Playing Atari with Deep Reinforcement Learning (18)

CSE460 Lecture 2.pptx.pdf
CSE460 Lecture 2.pptx.pdfCSE460 Lecture 2.pptx.pdf
CSE460 Lecture 2.pptx.pdf
 
Chapter02b
Chapter02bChapter02b
Chapter02b
 
Chapter06
Chapter06Chapter06
Chapter06
 
Chapter06
Chapter06Chapter06
Chapter06
 
MachineLearning_QLearningCircuit
MachineLearning_QLearningCircuitMachineLearning_QLearningCircuit
MachineLearning_QLearningCircuit
 
Notes on Reinforcement Learning - v0.1
Notes on Reinforcement Learning - v0.1Notes on Reinforcement Learning - v0.1
Notes on Reinforcement Learning - v0.1
 
Problem statement mathematical foundations
Problem statement mathematical foundationsProblem statement mathematical foundations
Problem statement mathematical foundations
 
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark
 
Alex1 group2
Alex1 group2Alex1 group2
Alex1 group2
 
DAA_LECT_2.pdf
DAA_LECT_2.pdfDAA_LECT_2.pdf
DAA_LECT_2.pdf
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
2018 MUMS Fall Course - Statistical and Mathematical Techniques for Sensitivi...
2018 MUMS Fall Course - Statistical and Mathematical Techniques for Sensitivi...2018 MUMS Fall Course - Statistical and Mathematical Techniques for Sensitivi...
2018 MUMS Fall Course - Statistical and Mathematical Techniques for Sensitivi...
 
Efficient Scalar Multiplication for Ate Based Pairing over KSS Curve of Embed...
Efficient Scalar Multiplication for Ate Based Pairing over KSS Curve of Embed...Efficient Scalar Multiplication for Ate Based Pairing over KSS Curve of Embed...
Efficient Scalar Multiplication for Ate Based Pairing over KSS Curve of Embed...
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep Learning
 

More from 郁凱 黃

Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Se...
Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Se...Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Se...
Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Se...
郁凱 黃
 
Human-level control through deep reinforcement learning
Human-level control through deep reinforcement learningHuman-level control through deep reinforcement learning
Human-level control through deep reinforcement learning
郁凱 黃
 
Ring loss: Convex Feature Normalization for Face Recognition
Ring loss: Convex Feature Normalization for Face RecognitionRing loss: Convex Feature Normalization for Face Recognition
Ring loss: Convex Feature Normalization for Face Recognition
郁凱 黃
 
Practical Block-wise Neural Network Architecture Generation
Practical Block-wise Neural Network Architecture GenerationPractical Block-wise Neural Network Architecture Generation
Practical Block-wise Neural Network Architecture Generation
郁凱 黃
 
A Revisit of Feature Learning on CNN-based Face Recognition
A Revisit of Feature Learning on CNN-based Face RecognitionA Revisit of Feature Learning on CNN-based Face Recognition
A Revisit of Feature Learning on CNN-based Face Recognition
郁凱 黃
 
Rose x Girl x White sheet
Rose x Girl x White sheetRose x Girl x White sheet
Rose x Girl x White sheet
郁凱 黃
 
Akatsuki Hackathon 2015 Demo
Akatsuki Hackathon 2015 DemoAkatsuki Hackathon 2015 Demo
Akatsuki Hackathon 2015 Demo
郁凱 黃
 
Introduction to FreeBSD commands
Introduction to FreeBSD commandsIntroduction to FreeBSD commands
Introduction to FreeBSD commands郁凱 黃
 
Introduction to FreeBSD commands(beta)
Introduction to FreeBSD commands(beta)Introduction to FreeBSD commands(beta)
Introduction to FreeBSD commands(beta)
郁凱 黃
 
電競大賽說明會ppt
電競大賽說明會ppt電競大賽說明會ppt
電競大賽說明會ppt郁凱 黃
 

More from 郁凱 黃 (10)

Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Se...
Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Se...Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Se...
Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Se...
 
Human-level control through deep reinforcement learning
Human-level control through deep reinforcement learningHuman-level control through deep reinforcement learning
Human-level control through deep reinforcement learning
 
Ring loss: Convex Feature Normalization for Face Recognition
Ring loss: Convex Feature Normalization for Face RecognitionRing loss: Convex Feature Normalization for Face Recognition
Ring loss: Convex Feature Normalization for Face Recognition
 
Practical Block-wise Neural Network Architecture Generation
Practical Block-wise Neural Network Architecture GenerationPractical Block-wise Neural Network Architecture Generation
Practical Block-wise Neural Network Architecture Generation
 
A Revisit of Feature Learning on CNN-based Face Recognition
A Revisit of Feature Learning on CNN-based Face RecognitionA Revisit of Feature Learning on CNN-based Face Recognition
A Revisit of Feature Learning on CNN-based Face Recognition
 
Rose x Girl x White sheet
Rose x Girl x White sheetRose x Girl x White sheet
Rose x Girl x White sheet
 
Akatsuki Hackathon 2015 Demo
Akatsuki Hackathon 2015 DemoAkatsuki Hackathon 2015 Demo
Akatsuki Hackathon 2015 Demo
 
Introduction to FreeBSD commands
Introduction to FreeBSD commandsIntroduction to FreeBSD commands
Introduction to FreeBSD commands
 
Introduction to FreeBSD commands(beta)
Introduction to FreeBSD commands(beta)Introduction to FreeBSD commands(beta)
Introduction to FreeBSD commands(beta)
 
電競大賽說明會ppt
電競大賽說明會ppt電競大賽說明會ppt
電競大賽說明會ppt
 

Recently uploaded

Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
ydzowc
 
Generative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdfGenerative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdf
mahaffeycheryld
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
Gas agency management system project report.pdf
Gas agency management system project report.pdfGas agency management system project report.pdf
Gas agency management system project report.pdf
Kamal Acharya
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
Nada Hikmah
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
UReason
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
Prakhyath Rai
 
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...
PIMR BHOPAL
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
nedcocy
 
Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
aryanpankaj78
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
Atif Razi
 
Design and optimization of ion propulsion drone
Design and optimization of ion propulsion droneDesign and optimization of ion propulsion drone
Design and optimization of ion propulsion drone
bjmsejournal
 
SCALING OF MOS CIRCUITS m .pptx
SCALING OF MOS CIRCUITS m                 .pptxSCALING OF MOS CIRCUITS m                 .pptx
SCALING OF MOS CIRCUITS m .pptx
harshapolam10
 
TIME TABLE MANAGEMENT SYSTEM testing.pptx
TIME TABLE MANAGEMENT SYSTEM testing.pptxTIME TABLE MANAGEMENT SYSTEM testing.pptx
TIME TABLE MANAGEMENT SYSTEM testing.pptx
CVCSOfficial
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 

Recently uploaded (20)

Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
 
Generative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdfGenerative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdf
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
Gas agency management system project report.pdf
Gas agency management system project report.pdfGas agency management system project report.pdf
Gas agency management system project report.pdf
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
 
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
 
Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
 
Design and optimization of ion propulsion drone
Design and optimization of ion propulsion droneDesign and optimization of ion propulsion drone
Design and optimization of ion propulsion drone
 
SCALING OF MOS CIRCUITS m .pptx
SCALING OF MOS CIRCUITS m                 .pptxSCALING OF MOS CIRCUITS m                 .pptx
SCALING OF MOS CIRCUITS m .pptx
 
TIME TABLE MANAGEMENT SYSTEM testing.pptx
TIME TABLE MANAGEMENT SYSTEM testing.pptxTIME TABLE MANAGEMENT SYSTEM testing.pptx
TIME TABLE MANAGEMENT SYSTEM testing.pptx
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 

Playing Atari with Deep Reinforcement Learning

  • 1. Playing Atari with Deep Reinforcement Learning Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller. NIPS Deep Learning Workshop 2013 Yu Kai Huang
  • 2. Outline ● Reinforcement Learning ● Markov Decision Process ○ State, Action(Policy), Reward ○ Value function, Bellman Equation ● Optimal Policy ○ Bellman Optimality Equation ○ Q-learning ○ Deep Q-learning Network ● Experiments ○ Training and Stability ○ Evaluation
  • 4. Reinforcement Learning Image from https://arxiv.org/pdf/1312.5602.pdf
  • 5. Reinforcement Learning Image from https://i.imgur.com/kw5Veqz.jpg
  • 6. Reinforcement Learning Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf
  • 7. Reinforcement Learning Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/intro_RL.pdf
  • 8. ● No supervisor, only a reward signal. ● Feedback is delayed, not instantaneous. ● Time really matters. ● Agent’s actions affect the subsequent data it receives. Reinforcement Learning Image from https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/1-1-A-RL/
  • 9. Reinforcement Learning ● State: the current situation that the agent is in. ○ e.g. moving (position, velocity, acceleration,...) Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf
  • 10. Reinforcement Learning ● State: the current situation that the agent is in. ○ e.g. moving (position, velocity, acceleration,...) ● Action: a command that agent can give in the game. ○ e.g. ↑, ↓, ←, → Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf
  • 11. Reinforcement Learning ● State: the current situation that the agent is in. ○ e.g. moving (position, velocity, acceleration,...) ● Action: a command that agent can give in the game. ○ e.g. ↑, ↓, ←, → ● Reward: given after performing an action. ○ e.g. +1, -100 Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf
  • 12. Reinforcement Learning ● Full observability: agent directly observes environment state. ● Agent state = environment state = information state ● Formally, this is a Markov Decision Process (MDP). Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf
  • 14. Markov Decision Process ● Markov decision processes formally describe an environment for reinforcement learning. ● Where the environment is fully observable. ● Almost all RL problems can be formalised as MDPs.
  • 15. Markov Decision Process: State ● An MDP is a directed graph which has states for its nodes and edges which describe transitions between Markov states. ○ State Transition Matrix ● Markov Property: “The future is independent of the past given the present” ○ The current state summarizes all past states. ○ e.g., if we only know the position of the ball but not its velocity, its state is no longer Markov. Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
  • 16. Example: Student MDP Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
  • 17. Markov Decision Process: Policy ● A policy fully defines the behaviour of an agent. Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
  • 18. Markov Decision Process: Reward and Return ● Each time you make a transition into a state, you receive a reward. ● Agents should learn to maximize cumulative future reward. ○ Return ○ Discount factor Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
  • 19. Markov Decision Process: Value function Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
  • 20. Markov Decision Process: Bellman Equation ● if we know the value of the next state, we can know the value of the current state. Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
  • 21. Markov Decision Process: Bellman Equation Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf, https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf
  • 22. Markov Decision Process: Bellman Equation Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf, https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf
  • 23. Markov Decision Process: Bellman Equation Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf, https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf
  • 24. Example: Student MDP Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdff
  • 26. Optimal Policy Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdff
  • 27. Optimal Value function Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdff
  • 28. Optimal Value function Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdff
  • 29. Bellman Optimality Equation for Q* Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdff
  • 30. The solution method of Bellman Optimality Equation
  • 31. Example: How to be a good kid? Image from https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/2-2-A-q-learning/
  • 32. Q-Learning ● Q-table a1(Watch TV) a2(Do homework) s1 0 0 s2 0 0 s3 0 0 ● Current state: s1 ● Select Action: ○ a1 = argmax(q(s1, a1), q(s1, a2)) a1(Watch TV) a2(Do homework) s1 -5 1 s2 -5 5 s3 0 0 ● Reward-table Set discount factor r = 1, alpha = 1 target(s, a) =
  • 33. Q-Learning ● Q-table a1(Watch TV) a2(Do homework) s1 0 0 s2 0 0 s3 0 0 ● Current state: s1 ● Select Action: ○ a1 = argmax(q(s1, a1), q(s1, a2)) ● Next state: s1 ● Reward: ○ R(s1, a1) = -5 ● Q-table: ○ delta = target(s1, a1) - q(s1, a1) = (-5+1*(0)) - 0 = -5 ○ q(s1, a1) = q(s1, a1) + alpha*delta = 0 + 1*(-5) = -5 a1(Watch TV) a2(Do homework) s1 -5 1 s2 -5 5 s3 0 0 ● Reward-table Set discount factor r = 1, alpha = 1 target(s, a) =
  • 34. Q-Learning ● Q-table a1(Watch TV) a2(Do homework) s1 -5 0 s2 0 0 s3 0 0 ● Current state: s1 ● Select Action: ○ a1 = argmax(q(s1, a1), q(s1, a2)) ● Next state: s1 ● Reward: ○ R(s1, a1) = -5 ● Q-table: ○ delta = target(s1, a1) - q(s1, a1) = (-5+1*(0)) - 0 = -5 ○ q(s1, a1) = q(s1, a1) + alpha*delta = 0 + 1*(-5) = -5 a1(Watch TV) a2(Do homework) s1 -5 1 s2 -5 5 s3 0 0 ● Reward-table Set discount factor r = 1, alpha = 1 target(s, a) =
  • 35. Q-Learning ● Q-table a1(Watch TV) a2(Do homework) s1 -5 0 s2 0 0 s3 0 0 ● Current state: s1 ● Select Action: ○ a2 = argmax(q(s1, a1), q(s1, a2)) a1(Watch TV) a2(Do homework) s1 -5 1 s2 -5 5 s3 0 0 ● Reward-table Set discount factor r = 1, alpha = 1 target(s, a) =
  • 36. Q-Learning ● Q-table a1(Watch TV) a2(Do homework) s1 -5 0 s2 0 0 s3 0 0 ● Current state: s1 ● Select Action: ○ a2 = argmax(q(s1, a1), q(s1, a2)) ● Next state: s2 ● Reward: ○ R(s1, a2) = 1 ● Q-table: ○ delta = target(s1, a2) - q(s1, a2) = (1+1*(0)) - 0= 1 ○ q(s1, a2) = q(s1, a2) + alpha*delta = 0 + 1*1 = 1 a1(Watch TV) a2(Do homework) s1 -5 1 s2 -5 5 s3 0 0 ● Reward-table Set discount factor r = 1, alpha = 1 target(s, a) =
  • 37. Q-Learning ● Q-table a1(Watch TV) a2(Do homework) s1 -5 1 s2 0 0 s3 0 0 ● Current state: s1 ● Select Action: ○ a2 = argmax(q(s1, a1), q(s1, a2)) ● Next state: s2 ● Reward: ○ R(s1, a2) = 1 ● Q-table: ○ delta = target(s1, a2) - q(s1, a2) = (1+1*(0)) - 0= 1 ○ q(s1, a2) = q(s1, a2) + alpha*delta = 0 + 1*1 = 1 a1(Watch TV) a2(Do homework) s1 -5 1 s2 -5 5 s3 0 0 ● Reward-table Set discount factor r = 1, alpha = 1 target(s, a) =
  • 38. Q-Learning ● Q-table a1(Watch TV) a2(Do homework) s1 -5 1 s2 0 0 s3 0 0 ● Current state: s2 ● Select Action: ○ a1 = argmax(q(s2, a1), q(s2, a2)) a1(Watch TV) a2(Do homework) s1 -5 1 s2 -5 5 s3 0 0 ● Reward-table Set discount factor r = 1, alpha = 1 target(s, a) =
  • 39. Q-Learning ● Q-table a1(Watch TV) a2(Do homework) s1 -5 1 s2 0 0 s3 0 0 ● Current state: s2 ● Select Action: ○ a1 = argmax(q(s2, a1), q(s2, a2)) ● Next state: s1 ● Reward: ○ R(s2, a1) = -5 ● Q-table: ○ delta = target(s2, a1) - q(s2, a1) = (-5+1*1) - 0= -4 ○ q(s2, a1) = q(s2, a1) + alpha*delta = 0 + 1*(-4) = -4 a1(Watch TV) a2(Do homework) s1 -5 1 s2 -5 5 s3 0 0 ● Reward-table Set discount factor r = 1, alpha = 1 target(s, a) =
  • 40. Q-Learning ● Q-table a1(Watch TV) a2(Do homework) s1 -5 1 s2 -4 0 s3 0 0 ● Current state: s2 ● Select Action: ○ a1 = argmax(q(s2, a1), q(s2, a2)) ● Next state: s1 ● Reward: ○ R(s2, a1) = -5 ● Q-table: ○ delta = target(s2, a1) - q(s2, a1) = (-5+1*1) - 0= -4 ○ q(s2, a1) = q(s2, a1) + alpha*delta = 0 + 1*(-4) = -4 a1(Watch TV) a2(Do homework) s1 -5 1 s2 -5 5 s3 0 0 ● Reward-table Set discount factor r = 1, alpha = 1 target(s, a) =
  • 41. Q-Learning Select Action Image from https://blog.techbridge.cc/2017/11/04/openai-gym-intro-and-q-learning/
  • 42. Q-Learning target(s, a) Image from https://blog.techbridge.cc/2017/11/04/openai-gym-intro-and-q-learning/
  • 43. Deep Q-learning Network ● Data Preprocessing: “The raw frames are preprocessed by first converting their RGB representation to gray-scale and down-sampling it to a 110×84 [...] cropping an 84 × 84 region of the image [...].” ● Model Architecture ○ Input size: 84x84x4 ○ Ouput size: 4 (←, →, x, B) ○ layers: ■ conv1(16, (8, 8), strides=(4, 4)) ■ conv2(32, (4, 4), strides=(2, 2)) ■ Dense(256) ■ Dense(4) Image from https://becominghuman.ai/lets-build-an-atari-ai-part-0-intro-to-rl-9b2c5336e0ec
  • 44. Deep Q-learning Network ● Experience Replay ○ “we store the agent’s experiences at each time-step, et = (st, at, rt, st+1) in a data-set D = e1, ..., eN , pooled over many episodes into a replay memory.”
  • 45. Deep Q-learning Network ● Optimal action-value function
  • 46. Deep Q-learning Network Image from https://arxiv.org/pdf/1312.5602.pdf
  • 48. Training and Stability (Evaluation Metric) Image from https://arxiv.org/pdf/1312.5602.pdf
  • 49. Training and Stability (Evaluation Metric) Image from https://arxiv.org/pdf/1312.5602.pdf
  • 50. Visualizing the Value Function Image from https://arxiv.org/pdf/1312.5602.pdf
  • 51. Main Evaluation ● A trial: 5,000 training episodes, followed by 500 evaluation episodes. ● Average performance across 30 trials. Image from https://arxiv.org/pdf/1312.5602.pdf
  • 52. Ref. [1] Jaromír Janisch: LET’S MAKE A DQN: THEORY https://jaromiru.com/2016/09/27/lets-make-a-dqn-theory/#fn-38-6 [2] Venelin Valkov: Solving an MDP with Q-Learning from scratch — Deep Reinforcement Learning for Hackers (Part 1) https://medium.com/@curiousily/solving-an-mdp-with-q-learning-from-scratch-deep-reinforcement-learning-for-hackers-part- 1-45d1d360c120 [3] Flood Sung: Deep Reinforcement Learning 基础知识(DQN方面 https://blog.csdn.net/songrotek/article/details/50580904 [4] Flood Sung: 增强学习Reinforcement Learning经典算法梳理1:policy and value iteration https://blog.csdn.net/songrotek/article/details/51378582 [5] mmc2015: reinforcement learning,增强学习:Policy Evaluation,Policy Iteration,Value Iteration,Dynamic Programming https://blog.csdn.net/mmc2015/article/details/52859611
  • 53. Ref. [6] Gai's Blog: 增强学习 Reinforcement learning part 1 - Introduction https://bluesmilery.github.io/blogs/481fe3af/ [7] Gai's Blog: 增强学习 Reinforcement learning part 2 - Markov Decision Process https://bluesmilery.github.io/blogs/e4dc3fbf/ [8] Gai's Blog: 增强学习 Reinforcement learning part 3 - Planning by Dynamic Programming https://bluesmilery.github.io/blogs/b96003ba/ [9] Rowan McAllister: Introduction to Reinforcement Learning http://mlg.eng.cam.ac.uk/rowan/files/rl/01_mdps.pdf [10] Adrien Lucas Ecoffet: Beat Atari with Deep Reinforcement Learning! (Part 0: Intro to RL) https://becominghuman.ai/lets-build-an-atari-ai-part-0-intro-to-rl-9b2c5336e0ec [11] Joshgreaves: Everything You Need to Know to Get Started in Reinforcement Learning https://joshgreaves.com/reinforcement-learning/introduction-to-reinforcement-learning/ [12] Katerina Fragkiadaki: Deep Reinforcement Learning and Control: Markov Decision Processes https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf
  • 54. Ref. [13] 莫烦: 什么是 Q Leaning https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/2-2-A-q-learning/