SlideShare a Scribd company logo
1 of 54
Download to read offline
Playing Atari with Deep
Reinforcement Learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller.
NIPS Deep Learning Workshop 2013
Yu Kai Huang
Outline
● Reinforcement Learning
● Markov Decision Process
○ State, Action(Policy), Reward
○ Value function, Bellman Equation
● Optimal Policy
○ Bellman Optimality Equation
○ Q-learning
○ Deep Q-learning Network
● Experiments
○ Training and Stability
○ Evaluation
Reinforcement Learning
Reinforcement Learning
Image from https://arxiv.org/pdf/1312.5602.pdf
Reinforcement Learning
Image from https://i.imgur.com/kw5Veqz.jpg
Reinforcement Learning
Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf
Reinforcement Learning
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/intro_RL.pdf
● No supervisor, only a reward
signal.
● Feedback is delayed, not
instantaneous.
● Time really matters.
● Agent’s actions affect the
subsequent data it receives.
Reinforcement Learning
Image from
https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/1-1-A-RL/
Reinforcement Learning
● State: the current situation that
the agent is in.
○ e.g. moving (position,
velocity, acceleration,...)
Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf
Reinforcement Learning
● State: the current situation that
the agent is in.
○ e.g. moving (position,
velocity, acceleration,...)
● Action: a command that agent
can give in the game.
○ e.g. ↑, ↓, ←, →
Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf
Reinforcement Learning
● State: the current situation that
the agent is in.
○ e.g. moving (position,
velocity, acceleration,...)
● Action: a command that agent
can give in the game.
○ e.g. ↑, ↓, ←, →
● Reward: given after performing
an action.
○ e.g. +1, -100
Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf
Reinforcement Learning
● Full observability: agent directly
observes environment state.
● Agent state = environment state
= information state
● Formally, this is a Markov
Decision Process (MDP).
Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf
Markov Decision Process
Markov Decision Process
● Markov decision processes formally describe an environment for
reinforcement learning.
● Where the environment is fully observable.
● Almost all RL problems can be formalised as MDPs.
Markov Decision Process: State
● An MDP is a directed graph which has states for its nodes and edges which
describe transitions between Markov states.
○ State Transition Matrix
● Markov Property: “The future is independent of the past given the present”
○ The current state summarizes all past states.
○ e.g., if we only know the position of the ball but not its velocity, its state is
no longer Markov.
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
Example: Student MDP
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
Markov Decision Process: Policy
● A policy fully defines the behaviour of an agent.
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
Markov Decision Process: Reward and Return
● Each time you make a transition into a state, you receive a reward.
● Agents should learn to maximize cumulative future reward.
○ Return
○ Discount factor
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
Markov Decision Process: Value function
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
Markov Decision Process: Bellman Equation
● if we know the value of the next state, we can know the value of the current
state.
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
Markov Decision Process: Bellman Equation
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf,
https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf
Markov Decision Process: Bellman Equation
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf,
https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf
Markov Decision Process: Bellman Equation
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf,
https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf
Example: Student MDP
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdff
Optimal Policy
Optimal Policy
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdff
Optimal Value function
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdff
Optimal Value function
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdff
Bellman Optimality Equation for Q*
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdff
The solution method of Bellman
Optimality Equation
Example: How to be a good kid?
Image from https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/2-2-A-q-learning/
Q-Learning
● Q-table
a1(Watch TV) a2(Do homework)
s1 0 0
s2 0 0
s3 0 0
● Current state: s1
● Select Action:
○ a1 = argmax(q(s1, a1), q(s1, a2))
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
Set discount factor r = 1, alpha = 1
target(s, a) =
Q-Learning
● Q-table
a1(Watch TV) a2(Do homework)
s1 0 0
s2 0 0
s3 0 0
● Current state: s1
● Select Action:
○ a1 = argmax(q(s1, a1), q(s1, a2))
● Next state: s1
● Reward:
○ R(s1, a1) = -5
● Q-table:
○ delta = target(s1, a1) - q(s1, a1)
= (-5+1*(0)) - 0 = -5
○ q(s1, a1) = q(s1, a1) + alpha*delta
= 0 + 1*(-5) = -5
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
Set discount factor r = 1, alpha = 1
target(s, a) =
Q-Learning
● Q-table
a1(Watch TV) a2(Do homework)
s1 -5 0
s2 0 0
s3 0 0
● Current state: s1
● Select Action:
○ a1 = argmax(q(s1, a1), q(s1, a2))
● Next state: s1
● Reward:
○ R(s1, a1) = -5
● Q-table:
○ delta = target(s1, a1) - q(s1, a1)
= (-5+1*(0)) - 0 = -5
○ q(s1, a1) = q(s1, a1) + alpha*delta
= 0 + 1*(-5) = -5
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
Set discount factor r = 1, alpha = 1
target(s, a) =
Q-Learning
● Q-table
a1(Watch TV) a2(Do homework)
s1 -5 0
s2 0 0
s3 0 0
● Current state: s1
● Select Action:
○ a2 = argmax(q(s1, a1), q(s1, a2))
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
Set discount factor r = 1, alpha = 1
target(s, a) =
Q-Learning
● Q-table
a1(Watch TV) a2(Do homework)
s1 -5 0
s2 0 0
s3 0 0
● Current state: s1
● Select Action:
○ a2 = argmax(q(s1, a1), q(s1, a2))
● Next state: s2
● Reward:
○ R(s1, a2) = 1
● Q-table:
○ delta = target(s1, a2) - q(s1, a2)
= (1+1*(0)) - 0= 1
○ q(s1, a2) = q(s1, a2) + alpha*delta
= 0 + 1*1 = 1
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
Set discount factor r = 1, alpha = 1
target(s, a) =
Q-Learning
● Q-table
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 0 0
s3 0 0
● Current state: s1
● Select Action:
○ a2 = argmax(q(s1, a1), q(s1, a2))
● Next state: s2
● Reward:
○ R(s1, a2) = 1
● Q-table:
○ delta = target(s1, a2) - q(s1, a2)
= (1+1*(0)) - 0= 1
○ q(s1, a2) = q(s1, a2) + alpha*delta
= 0 + 1*1 = 1
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
Set discount factor r = 1, alpha = 1
target(s, a) =
Q-Learning
● Q-table
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 0 0
s3 0 0
● Current state: s2
● Select Action:
○ a1 = argmax(q(s2, a1), q(s2, a2))
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
Set discount factor r = 1, alpha = 1
target(s, a) =
Q-Learning
● Q-table
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 0 0
s3 0 0
● Current state: s2
● Select Action:
○ a1 = argmax(q(s2, a1), q(s2, a2))
● Next state: s1
● Reward:
○ R(s2, a1) = -5
● Q-table:
○ delta = target(s2, a1) - q(s2, a1)
= (-5+1*1) - 0= -4
○ q(s2, a1) = q(s2, a1) + alpha*delta
= 0 + 1*(-4) = -4
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
Set discount factor r = 1, alpha = 1
target(s, a) =
Q-Learning
● Q-table
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 -4 0
s3 0 0
● Current state: s2
● Select Action:
○ a1 = argmax(q(s2, a1), q(s2, a2))
● Next state: s1
● Reward:
○ R(s2, a1) = -5
● Q-table:
○ delta = target(s2, a1) - q(s2, a1)
= (-5+1*1) - 0= -4
○ q(s2, a1) = q(s2, a1) + alpha*delta
= 0 + 1*(-4) = -4
a1(Watch TV) a2(Do homework)
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
Set discount factor r = 1, alpha = 1
target(s, a) =
Q-Learning
Select Action
Image from https://blog.techbridge.cc/2017/11/04/openai-gym-intro-and-q-learning/
Q-Learning
target(s, a)
Image from https://blog.techbridge.cc/2017/11/04/openai-gym-intro-and-q-learning/
Deep Q-learning Network
● Data Preprocessing: “The raw frames are preprocessed by first converting
their RGB representation to gray-scale and down-sampling it to a 110×84
[...] cropping an 84 × 84 region of the image [...].”
● Model Architecture
○ Input size: 84x84x4
○ Ouput size: 4 (←, →, x, B)
○ layers:
■ conv1(16, (8, 8), strides=(4, 4))
■ conv2(32, (4, 4), strides=(2, 2))
■ Dense(256)
■ Dense(4)
Image from https://becominghuman.ai/lets-build-an-atari-ai-part-0-intro-to-rl-9b2c5336e0ec
Deep Q-learning Network
● Experience Replay
○ “we store the agent’s experiences at each time-step, et = (st, at, rt, st+1)
in a data-set D = e1, ..., eN , pooled over many episodes into a replay
memory.”
Deep Q-learning Network
● Optimal action-value function
Deep Q-learning Network
Image from https://arxiv.org/pdf/1312.5602.pdf
Experiments
Training and Stability (Evaluation Metric)
Image from https://arxiv.org/pdf/1312.5602.pdf
Training and Stability (Evaluation Metric)
Image from https://arxiv.org/pdf/1312.5602.pdf
Visualizing the Value Function
Image from https://arxiv.org/pdf/1312.5602.pdf
Main Evaluation
● A trial: 5,000 training episodes, followed by 500 evaluation episodes.
● Average performance across 30 trials.
Image from https://arxiv.org/pdf/1312.5602.pdf
Ref.
[1] Jaromír Janisch: LET’S MAKE A DQN: THEORY https://jaromiru.com/2016/09/27/lets-make-a-dqn-theory/#fn-38-6
[2] Venelin Valkov: Solving an MDP with Q-Learning from scratch — Deep Reinforcement Learning for Hackers (Part 1)
https://medium.com/@curiousily/solving-an-mdp-with-q-learning-from-scratch-deep-reinforcement-learning-for-hackers-part-
1-45d1d360c120
[3] Flood Sung: Deep Reinforcement Learning 基础知识(DQN方面 https://blog.csdn.net/songrotek/article/details/50580904
[4] Flood Sung: 增强学习Reinforcement Learning经典算法梳理1:policy and value iteration
https://blog.csdn.net/songrotek/article/details/51378582
[5] mmc2015: reinforcement learning,增强学习:Policy Evaluation,Policy Iteration,Value Iteration,Dynamic Programming
https://blog.csdn.net/mmc2015/article/details/52859611
Ref.
[6] Gai's Blog: 增强学习 Reinforcement learning part 1 - Introduction https://bluesmilery.github.io/blogs/481fe3af/
[7] Gai's Blog: 增强学习 Reinforcement learning part 2 - Markov Decision Process
https://bluesmilery.github.io/blogs/e4dc3fbf/
[8] Gai's Blog: 增强学习 Reinforcement learning part 3 - Planning by Dynamic Programming
https://bluesmilery.github.io/blogs/b96003ba/
[9] Rowan McAllister: Introduction to Reinforcement Learning http://mlg.eng.cam.ac.uk/rowan/files/rl/01_mdps.pdf
[10] Adrien Lucas Ecoffet: Beat Atari with Deep Reinforcement Learning! (Part 0: Intro to RL)
https://becominghuman.ai/lets-build-an-atari-ai-part-0-intro-to-rl-9b2c5336e0ec
[11] Joshgreaves: Everything You Need to Know to Get Started in Reinforcement Learning
https://joshgreaves.com/reinforcement-learning/introduction-to-reinforcement-learning/
[12] Katerina Fragkiadaki: Deep Reinforcement Learning and Control: Markov Decision Processes
https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf
Ref.
[13] 莫烦: 什么是 Q Leaning
https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/2-2-A-q-learning/

More Related Content

Similar to Playing Atari with Deep Reinforcement Learning

CSE460 Lecture 2.pptx.pdf
CSE460 Lecture 2.pptx.pdfCSE460 Lecture 2.pptx.pdf
CSE460 Lecture 2.pptx.pdfFarhanFaruk3
 
MachineLearning_QLearningCircuit
MachineLearning_QLearningCircuitMachineLearning_QLearningCircuit
MachineLearning_QLearningCircuitSean Williams
 
Notes on Reinforcement Learning - v0.1
Notes on Reinforcement Learning - v0.1Notes on Reinforcement Learning - v0.1
Notes on Reinforcement Learning - v0.1Joo-Haeng Lee
 
Problem statement mathematical foundations
Problem statement mathematical foundationsProblem statement mathematical foundations
Problem statement mathematical foundationsBangaluru
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache SparkDB Tsai
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksBen Ball
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
Efficient Scalar Multiplication for Ate Based Pairing over KSS Curve of Embed...
Efficient Scalar Multiplication for Ate Based Pairing over KSS Curve of Embed...Efficient Scalar Multiplication for Ate Based Pairing over KSS Curve of Embed...
Efficient Scalar Multiplication for Ate Based Pairing over KSS Curve of Embed...Md. Al-Amin Khandaker Nipu
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningKhang Pham
 
Introduction to reinforcement learning
Introduction to reinforcement learningIntroduction to reinforcement learning
Introduction to reinforcement learningMarsan Ma
 

Similar to Playing Atari with Deep Reinforcement Learning (19)

CSE460 Lecture 2.pptx.pdf
CSE460 Lecture 2.pptx.pdfCSE460 Lecture 2.pptx.pdf
CSE460 Lecture 2.pptx.pdf
 
Chapter02b
Chapter02bChapter02b
Chapter02b
 
Chapter06
Chapter06Chapter06
Chapter06
 
Chapter06
Chapter06Chapter06
Chapter06
 
MachineLearning_QLearningCircuit
MachineLearning_QLearningCircuitMachineLearning_QLearningCircuit
MachineLearning_QLearningCircuit
 
Notes on Reinforcement Learning - v0.1
Notes on Reinforcement Learning - v0.1Notes on Reinforcement Learning - v0.1
Notes on Reinforcement Learning - v0.1
 
Problem statement mathematical foundations
Problem statement mathematical foundationsProblem statement mathematical foundations
Problem statement mathematical foundations
 
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark
 
Alex1 group2
Alex1 group2Alex1 group2
Alex1 group2
 
DAA_LECT_2.pdf
DAA_LECT_2.pdfDAA_LECT_2.pdf
DAA_LECT_2.pdf
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
2018 MUMS Fall Course - Statistical and Mathematical Techniques for Sensitivi...
2018 MUMS Fall Course - Statistical and Mathematical Techniques for Sensitivi...2018 MUMS Fall Course - Statistical and Mathematical Techniques for Sensitivi...
2018 MUMS Fall Course - Statistical and Mathematical Techniques for Sensitivi...
 
Efficient Scalar Multiplication for Ate Based Pairing over KSS Curve of Embed...
Efficient Scalar Multiplication for Ate Based Pairing over KSS Curve of Embed...Efficient Scalar Multiplication for Ate Based Pairing over KSS Curve of Embed...
Efficient Scalar Multiplication for Ate Based Pairing over KSS Curve of Embed...
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep Learning
 
Introduction to reinforcement learning
Introduction to reinforcement learningIntroduction to reinforcement learning
Introduction to reinforcement learning
 

More from 郁凱 黃

Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Se...
Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Se...Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Se...
Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Se...郁凱 黃
 
Human-level control through deep reinforcement learning
Human-level control through deep reinforcement learningHuman-level control through deep reinforcement learning
Human-level control through deep reinforcement learning郁凱 黃
 
Ring loss: Convex Feature Normalization for Face Recognition
Ring loss: Convex Feature Normalization for Face RecognitionRing loss: Convex Feature Normalization for Face Recognition
Ring loss: Convex Feature Normalization for Face Recognition郁凱 黃
 
Practical Block-wise Neural Network Architecture Generation
Practical Block-wise Neural Network Architecture GenerationPractical Block-wise Neural Network Architecture Generation
Practical Block-wise Neural Network Architecture Generation郁凱 黃
 
A Revisit of Feature Learning on CNN-based Face Recognition
A Revisit of Feature Learning on CNN-based Face RecognitionA Revisit of Feature Learning on CNN-based Face Recognition
A Revisit of Feature Learning on CNN-based Face Recognition郁凱 黃
 
Rose x Girl x White sheet
Rose x Girl x White sheetRose x Girl x White sheet
Rose x Girl x White sheet郁凱 黃
 
Akatsuki Hackathon 2015 Demo
Akatsuki Hackathon 2015 DemoAkatsuki Hackathon 2015 Demo
Akatsuki Hackathon 2015 Demo郁凱 黃
 
Introduction to FreeBSD commands
Introduction to FreeBSD commandsIntroduction to FreeBSD commands
Introduction to FreeBSD commands郁凱 黃
 
Introduction to FreeBSD commands(beta)
Introduction to FreeBSD commands(beta)Introduction to FreeBSD commands(beta)
Introduction to FreeBSD commands(beta)郁凱 黃
 
電競大賽說明會ppt
電競大賽說明會ppt電競大賽說明會ppt
電競大賽說明會ppt郁凱 黃
 

More from 郁凱 黃 (10)

Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Se...
Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Se...Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Se...
Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Se...
 
Human-level control through deep reinforcement learning
Human-level control through deep reinforcement learningHuman-level control through deep reinforcement learning
Human-level control through deep reinforcement learning
 
Ring loss: Convex Feature Normalization for Face Recognition
Ring loss: Convex Feature Normalization for Face RecognitionRing loss: Convex Feature Normalization for Face Recognition
Ring loss: Convex Feature Normalization for Face Recognition
 
Practical Block-wise Neural Network Architecture Generation
Practical Block-wise Neural Network Architecture GenerationPractical Block-wise Neural Network Architecture Generation
Practical Block-wise Neural Network Architecture Generation
 
A Revisit of Feature Learning on CNN-based Face Recognition
A Revisit of Feature Learning on CNN-based Face RecognitionA Revisit of Feature Learning on CNN-based Face Recognition
A Revisit of Feature Learning on CNN-based Face Recognition
 
Rose x Girl x White sheet
Rose x Girl x White sheetRose x Girl x White sheet
Rose x Girl x White sheet
 
Akatsuki Hackathon 2015 Demo
Akatsuki Hackathon 2015 DemoAkatsuki Hackathon 2015 Demo
Akatsuki Hackathon 2015 Demo
 
Introduction to FreeBSD commands
Introduction to FreeBSD commandsIntroduction to FreeBSD commands
Introduction to FreeBSD commands
 
Introduction to FreeBSD commands(beta)
Introduction to FreeBSD commands(beta)Introduction to FreeBSD commands(beta)
Introduction to FreeBSD commands(beta)
 
電競大賽說明會ppt
電競大賽說明會ppt電競大賽說明會ppt
電競大賽說明會ppt
 

Recently uploaded

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 

Recently uploaded (20)

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 

Playing Atari with Deep Reinforcement Learning

  • 1. Playing Atari with Deep Reinforcement Learning Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller. NIPS Deep Learning Workshop 2013 Yu Kai Huang
  • 2. Outline ● Reinforcement Learning ● Markov Decision Process ○ State, Action(Policy), Reward ○ Value function, Bellman Equation ● Optimal Policy ○ Bellman Optimality Equation ○ Q-learning ○ Deep Q-learning Network ● Experiments ○ Training and Stability ○ Evaluation
  • 4. Reinforcement Learning Image from https://arxiv.org/pdf/1312.5602.pdf
  • 5. Reinforcement Learning Image from https://i.imgur.com/kw5Veqz.jpg
  • 6. Reinforcement Learning Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf
  • 7. Reinforcement Learning Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/intro_RL.pdf
  • 8. ● No supervisor, only a reward signal. ● Feedback is delayed, not instantaneous. ● Time really matters. ● Agent’s actions affect the subsequent data it receives. Reinforcement Learning Image from https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/1-1-A-RL/
  • 9. Reinforcement Learning ● State: the current situation that the agent is in. ○ e.g. moving (position, velocity, acceleration,...) Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf
  • 10. Reinforcement Learning ● State: the current situation that the agent is in. ○ e.g. moving (position, velocity, acceleration,...) ● Action: a command that agent can give in the game. ○ e.g. ↑, ↓, ←, → Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf
  • 11. Reinforcement Learning ● State: the current situation that the agent is in. ○ e.g. moving (position, velocity, acceleration,...) ● Action: a command that agent can give in the game. ○ e.g. ↑, ↓, ←, → ● Reward: given after performing an action. ○ e.g. +1, -100 Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf
  • 12. Reinforcement Learning ● Full observability: agent directly observes environment state. ● Agent state = environment state = information state ● Formally, this is a Markov Decision Process (MDP). Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf
  • 14. Markov Decision Process ● Markov decision processes formally describe an environment for reinforcement learning. ● Where the environment is fully observable. ● Almost all RL problems can be formalised as MDPs.
  • 15. Markov Decision Process: State ● An MDP is a directed graph which has states for its nodes and edges which describe transitions between Markov states. ○ State Transition Matrix ● Markov Property: “The future is independent of the past given the present” ○ The current state summarizes all past states. ○ e.g., if we only know the position of the ball but not its velocity, its state is no longer Markov. Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
  • 16. Example: Student MDP Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
  • 17. Markov Decision Process: Policy ● A policy fully defines the behaviour of an agent. Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
  • 18. Markov Decision Process: Reward and Return ● Each time you make a transition into a state, you receive a reward. ● Agents should learn to maximize cumulative future reward. ○ Return ○ Discount factor Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
  • 19. Markov Decision Process: Value function Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
  • 20. Markov Decision Process: Bellman Equation ● if we know the value of the next state, we can know the value of the current state. Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
  • 21. Markov Decision Process: Bellman Equation Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf, https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf
  • 22. Markov Decision Process: Bellman Equation Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf, https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf
  • 23. Markov Decision Process: Bellman Equation Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf, https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf
  • 24. Example: Student MDP Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdff
  • 26. Optimal Policy Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdff
  • 27. Optimal Value function Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdff
  • 28. Optimal Value function Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdff
  • 29. Bellman Optimality Equation for Q* Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdff
  • 30. The solution method of Bellman Optimality Equation
  • 31. Example: How to be a good kid? Image from https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/2-2-A-q-learning/
  • 32. Q-Learning ● Q-table a1(Watch TV) a2(Do homework) s1 0 0 s2 0 0 s3 0 0 ● Current state: s1 ● Select Action: ○ a1 = argmax(q(s1, a1), q(s1, a2)) a1(Watch TV) a2(Do homework) s1 -5 1 s2 -5 5 s3 0 0 ● Reward-table Set discount factor r = 1, alpha = 1 target(s, a) =
  • 33. Q-Learning ● Q-table a1(Watch TV) a2(Do homework) s1 0 0 s2 0 0 s3 0 0 ● Current state: s1 ● Select Action: ○ a1 = argmax(q(s1, a1), q(s1, a2)) ● Next state: s1 ● Reward: ○ R(s1, a1) = -5 ● Q-table: ○ delta = target(s1, a1) - q(s1, a1) = (-5+1*(0)) - 0 = -5 ○ q(s1, a1) = q(s1, a1) + alpha*delta = 0 + 1*(-5) = -5 a1(Watch TV) a2(Do homework) s1 -5 1 s2 -5 5 s3 0 0 ● Reward-table Set discount factor r = 1, alpha = 1 target(s, a) =
  • 34. Q-Learning ● Q-table a1(Watch TV) a2(Do homework) s1 -5 0 s2 0 0 s3 0 0 ● Current state: s1 ● Select Action: ○ a1 = argmax(q(s1, a1), q(s1, a2)) ● Next state: s1 ● Reward: ○ R(s1, a1) = -5 ● Q-table: ○ delta = target(s1, a1) - q(s1, a1) = (-5+1*(0)) - 0 = -5 ○ q(s1, a1) = q(s1, a1) + alpha*delta = 0 + 1*(-5) = -5 a1(Watch TV) a2(Do homework) s1 -5 1 s2 -5 5 s3 0 0 ● Reward-table Set discount factor r = 1, alpha = 1 target(s, a) =
  • 35. Q-Learning ● Q-table a1(Watch TV) a2(Do homework) s1 -5 0 s2 0 0 s3 0 0 ● Current state: s1 ● Select Action: ○ a2 = argmax(q(s1, a1), q(s1, a2)) a1(Watch TV) a2(Do homework) s1 -5 1 s2 -5 5 s3 0 0 ● Reward-table Set discount factor r = 1, alpha = 1 target(s, a) =
  • 36. Q-Learning ● Q-table a1(Watch TV) a2(Do homework) s1 -5 0 s2 0 0 s3 0 0 ● Current state: s1 ● Select Action: ○ a2 = argmax(q(s1, a1), q(s1, a2)) ● Next state: s2 ● Reward: ○ R(s1, a2) = 1 ● Q-table: ○ delta = target(s1, a2) - q(s1, a2) = (1+1*(0)) - 0= 1 ○ q(s1, a2) = q(s1, a2) + alpha*delta = 0 + 1*1 = 1 a1(Watch TV) a2(Do homework) s1 -5 1 s2 -5 5 s3 0 0 ● Reward-table Set discount factor r = 1, alpha = 1 target(s, a) =
  • 37. Q-Learning ● Q-table a1(Watch TV) a2(Do homework) s1 -5 1 s2 0 0 s3 0 0 ● Current state: s1 ● Select Action: ○ a2 = argmax(q(s1, a1), q(s1, a2)) ● Next state: s2 ● Reward: ○ R(s1, a2) = 1 ● Q-table: ○ delta = target(s1, a2) - q(s1, a2) = (1+1*(0)) - 0= 1 ○ q(s1, a2) = q(s1, a2) + alpha*delta = 0 + 1*1 = 1 a1(Watch TV) a2(Do homework) s1 -5 1 s2 -5 5 s3 0 0 ● Reward-table Set discount factor r = 1, alpha = 1 target(s, a) =
  • 38. Q-Learning ● Q-table a1(Watch TV) a2(Do homework) s1 -5 1 s2 0 0 s3 0 0 ● Current state: s2 ● Select Action: ○ a1 = argmax(q(s2, a1), q(s2, a2)) a1(Watch TV) a2(Do homework) s1 -5 1 s2 -5 5 s3 0 0 ● Reward-table Set discount factor r = 1, alpha = 1 target(s, a) =
  • 39. Q-Learning ● Q-table a1(Watch TV) a2(Do homework) s1 -5 1 s2 0 0 s3 0 0 ● Current state: s2 ● Select Action: ○ a1 = argmax(q(s2, a1), q(s2, a2)) ● Next state: s1 ● Reward: ○ R(s2, a1) = -5 ● Q-table: ○ delta = target(s2, a1) - q(s2, a1) = (-5+1*1) - 0= -4 ○ q(s2, a1) = q(s2, a1) + alpha*delta = 0 + 1*(-4) = -4 a1(Watch TV) a2(Do homework) s1 -5 1 s2 -5 5 s3 0 0 ● Reward-table Set discount factor r = 1, alpha = 1 target(s, a) =
  • 40. Q-Learning ● Q-table a1(Watch TV) a2(Do homework) s1 -5 1 s2 -4 0 s3 0 0 ● Current state: s2 ● Select Action: ○ a1 = argmax(q(s2, a1), q(s2, a2)) ● Next state: s1 ● Reward: ○ R(s2, a1) = -5 ● Q-table: ○ delta = target(s2, a1) - q(s2, a1) = (-5+1*1) - 0= -4 ○ q(s2, a1) = q(s2, a1) + alpha*delta = 0 + 1*(-4) = -4 a1(Watch TV) a2(Do homework) s1 -5 1 s2 -5 5 s3 0 0 ● Reward-table Set discount factor r = 1, alpha = 1 target(s, a) =
  • 41. Q-Learning Select Action Image from https://blog.techbridge.cc/2017/11/04/openai-gym-intro-and-q-learning/
  • 42. Q-Learning target(s, a) Image from https://blog.techbridge.cc/2017/11/04/openai-gym-intro-and-q-learning/
  • 43. Deep Q-learning Network ● Data Preprocessing: “The raw frames are preprocessed by first converting their RGB representation to gray-scale and down-sampling it to a 110×84 [...] cropping an 84 × 84 region of the image [...].” ● Model Architecture ○ Input size: 84x84x4 ○ Ouput size: 4 (←, →, x, B) ○ layers: ■ conv1(16, (8, 8), strides=(4, 4)) ■ conv2(32, (4, 4), strides=(2, 2)) ■ Dense(256) ■ Dense(4) Image from https://becominghuman.ai/lets-build-an-atari-ai-part-0-intro-to-rl-9b2c5336e0ec
  • 44. Deep Q-learning Network ● Experience Replay ○ “we store the agent’s experiences at each time-step, et = (st, at, rt, st+1) in a data-set D = e1, ..., eN , pooled over many episodes into a replay memory.”
  • 45. Deep Q-learning Network ● Optimal action-value function
  • 46. Deep Q-learning Network Image from https://arxiv.org/pdf/1312.5602.pdf
  • 48. Training and Stability (Evaluation Metric) Image from https://arxiv.org/pdf/1312.5602.pdf
  • 49. Training and Stability (Evaluation Metric) Image from https://arxiv.org/pdf/1312.5602.pdf
  • 50. Visualizing the Value Function Image from https://arxiv.org/pdf/1312.5602.pdf
  • 51. Main Evaluation ● A trial: 5,000 training episodes, followed by 500 evaluation episodes. ● Average performance across 30 trials. Image from https://arxiv.org/pdf/1312.5602.pdf
  • 52. Ref. [1] Jaromír Janisch: LET’S MAKE A DQN: THEORY https://jaromiru.com/2016/09/27/lets-make-a-dqn-theory/#fn-38-6 [2] Venelin Valkov: Solving an MDP with Q-Learning from scratch — Deep Reinforcement Learning for Hackers (Part 1) https://medium.com/@curiousily/solving-an-mdp-with-q-learning-from-scratch-deep-reinforcement-learning-for-hackers-part- 1-45d1d360c120 [3] Flood Sung: Deep Reinforcement Learning 基础知识(DQN方面 https://blog.csdn.net/songrotek/article/details/50580904 [4] Flood Sung: 增强学习Reinforcement Learning经典算法梳理1:policy and value iteration https://blog.csdn.net/songrotek/article/details/51378582 [5] mmc2015: reinforcement learning,增强学习:Policy Evaluation,Policy Iteration,Value Iteration,Dynamic Programming https://blog.csdn.net/mmc2015/article/details/52859611
  • 53. Ref. [6] Gai's Blog: 增强学习 Reinforcement learning part 1 - Introduction https://bluesmilery.github.io/blogs/481fe3af/ [7] Gai's Blog: 增强学习 Reinforcement learning part 2 - Markov Decision Process https://bluesmilery.github.io/blogs/e4dc3fbf/ [8] Gai's Blog: 增强学习 Reinforcement learning part 3 - Planning by Dynamic Programming https://bluesmilery.github.io/blogs/b96003ba/ [9] Rowan McAllister: Introduction to Reinforcement Learning http://mlg.eng.cam.ac.uk/rowan/files/rl/01_mdps.pdf [10] Adrien Lucas Ecoffet: Beat Atari with Deep Reinforcement Learning! (Part 0: Intro to RL) https://becominghuman.ai/lets-build-an-atari-ai-part-0-intro-to-rl-9b2c5336e0ec [11] Joshgreaves: Everything You Need to Know to Get Started in Reinforcement Learning https://joshgreaves.com/reinforcement-learning/introduction-to-reinforcement-learning/ [12] Katerina Fragkiadaki: Deep Reinforcement Learning and Control: Markov Decision Processes https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf
  • 54. Ref. [13] 莫烦: 什么是 Q Leaning https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/2-2-A-q-learning/