SlideShare a Scribd company logo
Google confidential | Do not distribute
Deep Q-Network for beginners
Etsuji Nakai
Cloud Solutions Architect at Google
2016/08/01 ver1.0
$ who am i
▪Etsuji Nakai
Cloud Solutions Architect at Google
Twitter @enakai00
Typical application of DQN
▪ Learning “the optimal operations to achieve the best score” through the screen
images of video games.
– In theory, you can learn it (without knowing the rule of the game) by collecting
data consisting of “on what screen, with which operation, how the score will
change, and what will be the next screen.”
– This is analogous to construct an algorithm for the game of Go by collecting
data consisting of “On what face of board, where you put the next stone, how
your advantage will change.”
https://www.youtube.com/watch?v=r3pb-ZDEKVghttps://www.youtube.com/watch?v=V1eYniJ0Rnk
Theoretical framework of DQN
▪ Suppose that you have all the quartets (s, a, r, s') for any pair (s, a), meaning “with
the current state s and the action a, you will have a reward (score) r and the next
state will be s' ”
– This corresponds to the data “on what screen, with which operation, how the score will
change, and what will be the next screen.”
– It is impractical to collect data for all possible pairs (s, a), but suppose that you have
enough of them to train the model to a certain level.
– Note that, r and s' are functions of the pair (s, a) in terms of mathematics.
▪ You may naively think that the following gives the optimal action given the current
state s.
 ⇒ Choose the action a which maximizes the immediate reward r .
– But this doesn't necessarily result in the best scenario. In the case of Breakout, you’d
better hit blocks near the side walls even though it may take a little longer.
▪ In a nutshell, you have to figure out the action a which maximizes the long term
rewards.
Let’s imagine the magical “Q” function
▪ First, we define the total rewards as below.
– sn
and an
represent the state and action at n-th step. is a small number around 0.9
introduced to avoid the sum becomes infinite.
▪ Now suppose that we have a convenient magical function Q(s, a) as below although
we don’t know how to calculate it at all.
– Q(s, a) = “The total rewards you will receive when you choose the next action a,
and keep choosing the optimal actions afterwards.”
▪ Once you have the function Q(s, a) , you can choose the optimal action at state s
with the following rule.
 ⇒ Choose an action a which maximizes the total rewards if you keep choosing the optimal
actions afterwards.
The black magic of “recursive definition”
▪ Although we are not sure how we could calculate Q(s, a) , I can say that it satisfies
the following “Q-equation”.
– See the next slide for the mathematical proof.
Proof of the Q-equation
– Suppose that the chain of states and actions is as below when you start from the state s0
and keep choosing the best actions.
– From the definition of Q(s, a), the following equation holds.
– Now suppose that the initial state is s1
instead of s0
, and you keep choosing the best
actions, the chain will be as below.
– Again, from the definition of Q(s, a), the following equation holds.
– Rearrange (1) as below, and substitute (2).
– Considering the following relations, it’s equivalent to the Q-equation.
―― (1)
―― (2)
Approximate “Q” function using Q-equation
▪ Prepare some function with adjustable parameters, and by adjusting them, you may
find a function which satisfies the Q-equation.
– If you succeeded, now you have the “Q” function!
– Strictly speaking, Q-equation is not a sufficient condition but a necessary one. However,
under some assumptions, it’s been proved to be sufficient.
▪ Here’s the steps to adjust the parameters.
– 1. Let all the quartets (s, a, r, a') you have is D.
– 2. Prepare an initial candidate of “Q” function as Q(s, a | w).
– 3. Calculate the following error function E(w) using all (or a part of) data in D. This is the
sum of squared differences between LHS and RHS of the Q-equation.
– 4. Adjust the parameter w so that E(w) becomes smaller. Then, go back to 3.
▪ After repeating 3. and 4., if E(w) becomes small enough, you have the approximate
version of the “Q” function.
– You’d better use more complicated candidate Q(s, a | w) , so that you would have better
approximation.
What is “the more complicated function”?
▪ Yes!
The Deep Neural Network!
▪ “Deep Q-Network” is essentially a multi-layer neural network used as the candidate
of Q-function.
By the way, what is Neural Network?
▪ Roughly speaking, it’s just a combination of multiple simple functions resulting in a
highly complex function.
How do you collect the quartets?
▪ How do you collect the quartets (s, a, r, a') in the real world?
– Basically, just keep playing the game with random actions.
– In theory, if you keep playing for infinite time, you would encounter all the possible states.
▪ But in reality, the possibility to reach some states with random actions is quite
small. To compensate it, you can take the following strategy.
– Once you have collected some amount of data, train the Q-function using these data.
– After that, you play the game by mixing random actions and the (presumably) best actions
calculated from the current Q-function.
– When you have collected some more additional data, train the Q-function again.
– Through this cycle, you can make Q-function better, and collect more data including states
which is unreachable with only random actions.
▪ Why don’t you play only using Q-function without random actions?
– It doesn’t work. By collecting all kinds of states even with random actions, the
model can learn “how to gain long term rewards through some non-rewarding
states.”
Thank you!

More Related Content

Viewers also liked

A Brief History of My English Learning
A Brief History of My English LearningA Brief History of My English Learning
A Brief History of My English Learning
Etsuji Nakai
 
Spannerに関する技術メモ
Spannerに関する技術メモSpannerに関する技術メモ
Spannerに関する技術メモ
Etsuji Nakai
 
Exploring the Philosophy behind Docker/Kubernetes/OpenShift
Exploring the Philosophy behind Docker/Kubernetes/OpenShiftExploring the Philosophy behind Docker/Kubernetes/OpenShift
Exploring the Philosophy behind Docker/Kubernetes/OpenShift
Etsuji Nakai
 
Using Kubernetes on Google Container Engine
Using Kubernetes on Google Container EngineUsing Kubernetes on Google Container Engine
Using Kubernetes on Google Container Engine
Etsuji Nakai
 
TensorFlowによるニューラルネットワーク入門
TensorFlowによるニューラルネットワーク入門TensorFlowによるニューラルネットワーク入門
TensorFlowによるニューラルネットワーク入門
Etsuji Nakai
 
TensorFlowプログラミングと分類アルゴリズムの基礎
TensorFlowプログラミングと分類アルゴリズムの基礎TensorFlowプログラミングと分類アルゴリズムの基礎
TensorFlowプログラミングと分類アルゴリズムの基礎
Etsuji Nakai
 
Machine Learning Basics for Web Application Developers
Machine Learning Basics for Web Application DevelopersMachine Learning Basics for Web Application Developers
Machine Learning Basics for Web Application Developers
Etsuji Nakai
 
Googleのインフラ技術から考える理想のDevOps
Googleのインフラ技術から考える理想のDevOpsGoogleのインフラ技術から考える理想のDevOps
Googleのインフラ技術から考える理想のDevOps
Etsuji Nakai
 
Open Shift v3 主要機能と内部構造のご紹介
Open Shift v3 主要機能と内部構造のご紹介Open Shift v3 主要機能と内部構造のご紹介
Open Shift v3 主要機能と内部構造のご紹介
Etsuji Nakai
 
「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)
「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)
「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)
Etsuji Nakai
 
Docker活用パターンの整理 ― どう組み合わせるのが正解?!
Docker活用パターンの整理 ― どう組み合わせるのが正解?!Docker活用パターンの整理 ― どう組み合わせるのが正解?!
Docker活用パターンの整理 ― どう組み合わせるのが正解?!
Etsuji Nakai
 
Python 機械学習プログラミング データ分析ライブラリー解説編
Python 機械学習プログラミング データ分析ライブラリー解説編Python 機械学習プログラミング データ分析ライブラリー解説編
Python 機械学習プログラミング データ分析ライブラリー解説編
Etsuji Nakai
 
分散ストレージソフトウェアCeph・アーキテクチャー概要
分散ストレージソフトウェアCeph・アーキテクチャー概要分散ストレージソフトウェアCeph・アーキテクチャー概要
分散ストレージソフトウェアCeph・アーキテクチャー概要
Etsuji Nakai
 
機械学習概論 講義テキスト
機械学習概論 講義テキスト機械学習概論 講義テキスト
機械学習概論 講義テキスト
Etsuji Nakai
 
Docker with RHEL7 技術勉強会
Docker with RHEL7 技術勉強会Docker with RHEL7 技術勉強会
Docker with RHEL7 技術勉強会
Etsuji Nakai
 
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニックOpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
Etsuji Nakai
 
Linux Container Technology inside Docker with RHEL7
Linux Container Technology inside Docker with RHEL7Linux Container Technology inside Docker with RHEL7
Linux Container Technology inside Docker with RHEL7
Etsuji Nakai
 
OpenStackとDockerの未来像
OpenStackとDockerの未来像OpenStackとDockerの未来像
OpenStackとDockerの未来像
Etsuji Nakai
 
Dockerイメージ管理の内部構造
Dockerイメージ管理の内部構造Dockerイメージ管理の内部構造
Dockerイメージ管理の内部構造
Etsuji Nakai
 
OpenShift v3 Internal networking details
OpenShift v3 Internal networking detailsOpenShift v3 Internal networking details
OpenShift v3 Internal networking details
Etsuji Nakai
 

Viewers also liked (20)

A Brief History of My English Learning
A Brief History of My English LearningA Brief History of My English Learning
A Brief History of My English Learning
 
Spannerに関する技術メモ
Spannerに関する技術メモSpannerに関する技術メモ
Spannerに関する技術メモ
 
Exploring the Philosophy behind Docker/Kubernetes/OpenShift
Exploring the Philosophy behind Docker/Kubernetes/OpenShiftExploring the Philosophy behind Docker/Kubernetes/OpenShift
Exploring the Philosophy behind Docker/Kubernetes/OpenShift
 
Using Kubernetes on Google Container Engine
Using Kubernetes on Google Container EngineUsing Kubernetes on Google Container Engine
Using Kubernetes on Google Container Engine
 
TensorFlowによるニューラルネットワーク入門
TensorFlowによるニューラルネットワーク入門TensorFlowによるニューラルネットワーク入門
TensorFlowによるニューラルネットワーク入門
 
TensorFlowプログラミングと分類アルゴリズムの基礎
TensorFlowプログラミングと分類アルゴリズムの基礎TensorFlowプログラミングと分類アルゴリズムの基礎
TensorFlowプログラミングと分類アルゴリズムの基礎
 
Machine Learning Basics for Web Application Developers
Machine Learning Basics for Web Application DevelopersMachine Learning Basics for Web Application Developers
Machine Learning Basics for Web Application Developers
 
Googleのインフラ技術から考える理想のDevOps
Googleのインフラ技術から考える理想のDevOpsGoogleのインフラ技術から考える理想のDevOps
Googleのインフラ技術から考える理想のDevOps
 
Open Shift v3 主要機能と内部構造のご紹介
Open Shift v3 主要機能と内部構造のご紹介Open Shift v3 主要機能と内部構造のご紹介
Open Shift v3 主要機能と内部構造のご紹介
 
「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)
「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)
「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)
 
Docker活用パターンの整理 ― どう組み合わせるのが正解?!
Docker活用パターンの整理 ― どう組み合わせるのが正解?!Docker活用パターンの整理 ― どう組み合わせるのが正解?!
Docker活用パターンの整理 ― どう組み合わせるのが正解?!
 
Python 機械学習プログラミング データ分析ライブラリー解説編
Python 機械学習プログラミング データ分析ライブラリー解説編Python 機械学習プログラミング データ分析ライブラリー解説編
Python 機械学習プログラミング データ分析ライブラリー解説編
 
分散ストレージソフトウェアCeph・アーキテクチャー概要
分散ストレージソフトウェアCeph・アーキテクチャー概要分散ストレージソフトウェアCeph・アーキテクチャー概要
分散ストレージソフトウェアCeph・アーキテクチャー概要
 
機械学習概論 講義テキスト
機械学習概論 講義テキスト機械学習概論 講義テキスト
機械学習概論 講義テキスト
 
Docker with RHEL7 技術勉強会
Docker with RHEL7 技術勉強会Docker with RHEL7 技術勉強会
Docker with RHEL7 技術勉強会
 
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニックOpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
 
Linux Container Technology inside Docker with RHEL7
Linux Container Technology inside Docker with RHEL7Linux Container Technology inside Docker with RHEL7
Linux Container Technology inside Docker with RHEL7
 
OpenStackとDockerの未来像
OpenStackとDockerの未来像OpenStackとDockerの未来像
OpenStackとDockerの未来像
 
Dockerイメージ管理の内部構造
Dockerイメージ管理の内部構造Dockerイメージ管理の内部構造
Dockerイメージ管理の内部構造
 
OpenShift v3 Internal networking details
OpenShift v3 Internal networking detailsOpenShift v3 Internal networking details
OpenShift v3 Internal networking details
 

Similar to Deep Q-Network for beginners

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Elias Hasnat
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep Q-learning explained
Deep Q-learning explainedDeep Q-learning explained
Deep Q-learning explained
azzeddine chenine
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Hogeon Seo
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
재연 윤
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
Usman Qayyum
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning
Julia Maddalena
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
Ben Ball
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
Ronald Teo
 
Intro to Reinforcement Learning
Intro to Reinforcement LearningIntro to Reinforcement Learning
Intro to Reinforcement Learning
Utkarsh Garg
 
Single shot multiboxdetectors
Single shot multiboxdetectorsSingle shot multiboxdetectors
Single shot multiboxdetectors
지현 백
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQN
Euijin Jeong
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
Natan Katz
 
Chakrabarti alpha go analysis
Chakrabarti alpha go analysisChakrabarti alpha go analysis
Chakrabarti alpha go analysis
Dave Selinger
 
Single shot multiboxdetectors
Single shot multiboxdetectorsSingle shot multiboxdetectors
Single shot multiboxdetectors
지현 백
 
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlow
Illia Polosukhin
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
Khaled Saleh
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
MLconf
 
Cs229 notes12
Cs229 notes12Cs229 notes12
Cs229 notes12
VuTran231
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
Mohammaderfan Arefimoghaddam
 

Similar to Deep Q-Network for beginners (20)

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 
Deep Q-learning explained
Deep Q-learning explainedDeep Q-learning explained
Deep Q-learning explained
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Intro to Reinforcement Learning
Intro to Reinforcement LearningIntro to Reinforcement Learning
Intro to Reinforcement Learning
 
Single shot multiboxdetectors
Single shot multiboxdetectorsSingle shot multiboxdetectors
Single shot multiboxdetectors
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQN
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Chakrabarti alpha go analysis
Chakrabarti alpha go analysisChakrabarti alpha go analysis
Chakrabarti alpha go analysis
 
Single shot multiboxdetectors
Single shot multiboxdetectorsSingle shot multiboxdetectors
Single shot multiboxdetectors
 
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlow
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Cs229 notes12
Cs229 notes12Cs229 notes12
Cs229 notes12
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 

More from Etsuji Nakai

PRML11.2-11.3
PRML11.2-11.3PRML11.2-11.3
PRML11.2-11.3
Etsuji Nakai
 
「ITエンジニアリングの本質」を考える
「ITエンジニアリングの本質」を考える「ITエンジニアリングの本質」を考える
「ITエンジニアリングの本質」を考える
Etsuji Nakai
 
Googleのインフラ技術に見る基盤標準化とDevOpsの真実
Googleのインフラ技術に見る基盤標準化とDevOpsの真実Googleのインフラ技術に見る基盤標準化とDevOpsの真実
Googleのインフラ技術に見る基盤標準化とDevOpsの真実
Etsuji Nakai
 
Introducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowIntroducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlow
Etsuji Nakai
 
Googleにおける機械学習の活用とクラウドサービス
Googleにおける機械学習の活用とクラウドサービスGoogleにおける機械学習の活用とクラウドサービス
Googleにおける機械学習の活用とクラウドサービス
Etsuji Nakai
 
Lecture note on PRML 8.2
Lecture note on PRML 8.2Lecture note on PRML 8.2
Lecture note on PRML 8.2
Etsuji Nakai
 
PRML7.2
PRML7.2PRML7.2
PRML7.2
Etsuji Nakai
 
OpenShift v3 Technical Introduction
OpenShift v3 Technical IntroductionOpenShift v3 Technical Introduction
OpenShift v3 Technical Introduction
Etsuji Nakai
 
Python 機械学習プログラミング データ分析演習編
Python 機械学習プログラミング データ分析演習編Python 機械学習プログラミング データ分析演習編
Python 機械学習プログラミング データ分析演習編
Etsuji Nakai
 
Red Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA Architecture
Red Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA ArchitectureRed Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA Architecture
Red Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA Architecture
Etsuji Nakai
 

More from Etsuji Nakai (10)

PRML11.2-11.3
PRML11.2-11.3PRML11.2-11.3
PRML11.2-11.3
 
「ITエンジニアリングの本質」を考える
「ITエンジニアリングの本質」を考える「ITエンジニアリングの本質」を考える
「ITエンジニアリングの本質」を考える
 
Googleのインフラ技術に見る基盤標準化とDevOpsの真実
Googleのインフラ技術に見る基盤標準化とDevOpsの真実Googleのインフラ技術に見る基盤標準化とDevOpsの真実
Googleのインフラ技術に見る基盤標準化とDevOpsの真実
 
Introducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowIntroducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlow
 
Googleにおける機械学習の活用とクラウドサービス
Googleにおける機械学習の活用とクラウドサービスGoogleにおける機械学習の活用とクラウドサービス
Googleにおける機械学習の活用とクラウドサービス
 
Lecture note on PRML 8.2
Lecture note on PRML 8.2Lecture note on PRML 8.2
Lecture note on PRML 8.2
 
PRML7.2
PRML7.2PRML7.2
PRML7.2
 
OpenShift v3 Technical Introduction
OpenShift v3 Technical IntroductionOpenShift v3 Technical Introduction
OpenShift v3 Technical Introduction
 
Python 機械学習プログラミング データ分析演習編
Python 機械学習プログラミング データ分析演習編Python 機械学習プログラミング データ分析演習編
Python 機械学習プログラミング データ分析演習編
 
Red Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA Architecture
Red Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA ArchitectureRed Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA Architecture
Red Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA Architecture
 

Recently uploaded

Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
ZachWylie3
 
Uncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in LibrariesUncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in Libraries
Brian Pichman
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Zilliz
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
 
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
SynapseIndia
 
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
shanihomely
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
bhumivarma35300
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
Google Developer Group - Harare
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
Bhajan Mehta
 
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
Priyanka Aash
 
Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3
DianaGray10
 
Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10
ankush9927
 
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
alexjohnson7307
 
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
FIDO Alliance
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
sunilverma7884
 
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
maigasapphire
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
Razin Mustafiz
 
Feature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptxFeature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptx
ssuser1915fe1
 
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
Zilliz
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
BrainSell Technologies
 

Recently uploaded (20)

Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
 
Uncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in LibrariesUncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in Libraries
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
 
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
 
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
 
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
 
Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3
 
Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10
 
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
 
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
 
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
 
Feature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptxFeature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptx
 
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
 

Deep Q-Network for beginners

  • 1. Google confidential | Do not distribute Deep Q-Network for beginners Etsuji Nakai Cloud Solutions Architect at Google 2016/08/01 ver1.0
  • 2. $ who am i ▪Etsuji Nakai Cloud Solutions Architect at Google Twitter @enakai00
  • 3. Typical application of DQN ▪ Learning “the optimal operations to achieve the best score” through the screen images of video games. – In theory, you can learn it (without knowing the rule of the game) by collecting data consisting of “on what screen, with which operation, how the score will change, and what will be the next screen.” – This is analogous to construct an algorithm for the game of Go by collecting data consisting of “On what face of board, where you put the next stone, how your advantage will change.” https://www.youtube.com/watch?v=r3pb-ZDEKVghttps://www.youtube.com/watch?v=V1eYniJ0Rnk
  • 4. Theoretical framework of DQN ▪ Suppose that you have all the quartets (s, a, r, s') for any pair (s, a), meaning “with the current state s and the action a, you will have a reward (score) r and the next state will be s' ” – This corresponds to the data “on what screen, with which operation, how the score will change, and what will be the next screen.” – It is impractical to collect data for all possible pairs (s, a), but suppose that you have enough of them to train the model to a certain level. – Note that, r and s' are functions of the pair (s, a) in terms of mathematics. ▪ You may naively think that the following gives the optimal action given the current state s.  ⇒ Choose the action a which maximizes the immediate reward r . – But this doesn't necessarily result in the best scenario. In the case of Breakout, you’d better hit blocks near the side walls even though it may take a little longer. ▪ In a nutshell, you have to figure out the action a which maximizes the long term rewards.
  • 5. Let’s imagine the magical “Q” function ▪ First, we define the total rewards as below. – sn and an represent the state and action at n-th step. is a small number around 0.9 introduced to avoid the sum becomes infinite. ▪ Now suppose that we have a convenient magical function Q(s, a) as below although we don’t know how to calculate it at all. – Q(s, a) = “The total rewards you will receive when you choose the next action a, and keep choosing the optimal actions afterwards.” ▪ Once you have the function Q(s, a) , you can choose the optimal action at state s with the following rule.  ⇒ Choose an action a which maximizes the total rewards if you keep choosing the optimal actions afterwards.
  • 6. The black magic of “recursive definition” ▪ Although we are not sure how we could calculate Q(s, a) , I can say that it satisfies the following “Q-equation”. – See the next slide for the mathematical proof.
  • 7. Proof of the Q-equation – Suppose that the chain of states and actions is as below when you start from the state s0 and keep choosing the best actions. – From the definition of Q(s, a), the following equation holds. – Now suppose that the initial state is s1 instead of s0 , and you keep choosing the best actions, the chain will be as below. – Again, from the definition of Q(s, a), the following equation holds. – Rearrange (1) as below, and substitute (2). – Considering the following relations, it’s equivalent to the Q-equation. ―― (1) ―― (2)
  • 8. Approximate “Q” function using Q-equation ▪ Prepare some function with adjustable parameters, and by adjusting them, you may find a function which satisfies the Q-equation. – If you succeeded, now you have the “Q” function! – Strictly speaking, Q-equation is not a sufficient condition but a necessary one. However, under some assumptions, it’s been proved to be sufficient. ▪ Here’s the steps to adjust the parameters. – 1. Let all the quartets (s, a, r, a') you have is D. – 2. Prepare an initial candidate of “Q” function as Q(s, a | w). – 3. Calculate the following error function E(w) using all (or a part of) data in D. This is the sum of squared differences between LHS and RHS of the Q-equation. – 4. Adjust the parameter w so that E(w) becomes smaller. Then, go back to 3. ▪ After repeating 3. and 4., if E(w) becomes small enough, you have the approximate version of the “Q” function. – You’d better use more complicated candidate Q(s, a | w) , so that you would have better approximation.
  • 9. What is “the more complicated function”? ▪ Yes! The Deep Neural Network! ▪ “Deep Q-Network” is essentially a multi-layer neural network used as the candidate of Q-function.
  • 10. By the way, what is Neural Network? ▪ Roughly speaking, it’s just a combination of multiple simple functions resulting in a highly complex function.
  • 11. How do you collect the quartets? ▪ How do you collect the quartets (s, a, r, a') in the real world? – Basically, just keep playing the game with random actions. – In theory, if you keep playing for infinite time, you would encounter all the possible states. ▪ But in reality, the possibility to reach some states with random actions is quite small. To compensate it, you can take the following strategy. – Once you have collected some amount of data, train the Q-function using these data. – After that, you play the game by mixing random actions and the (presumably) best actions calculated from the current Q-function. – When you have collected some more additional data, train the Q-function again. – Through this cycle, you can make Q-function better, and collect more data including states which is unreachable with only random actions. ▪ Why don’t you play only using Q-function without random actions? – It doesn’t work. By collecting all kinds of states even with random actions, the model can learn “how to gain long term rewards through some non-rewarding states.”