SlideShare a Scribd company logo
1 of 63
Download to read offline
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Principal Solutions Architect – AWS Deep Learning
Amazon Web Services
Game Playing RL Agent
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Inspiration From Nature
https://newatlas.com/bae-smartskin/33458/ https://www.nasa.gov/ames/feature/go-go-green-wing-mighty-morphing-materials-in-aircraft-design
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hardware of Learning
http://www.biologyreference.com/Mo-Nu/Neuron.html
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hardware of Learning
http://www.biologyreference.com/Mo-Nu/Neuron.html
I1 I2 B
O
w2 w3
! "#, %# = Φ() + Σ#(%#. "#))
Φ " = .
1, 0! " ≥ 0.5
0, 0! " < 0.5
w1
5 ∧ Q
8 9 8 ∧ Q
: : :
: ; ;
; : ;
; : ;
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Process of Learning
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Process of Learning
Agent
Environment!"#$
!"
state
%"#$
%"
reward
&"
action
Sutton and Barto
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Markov State
An information state (a.k.a. Markov state) contains all
useful information from the history.
A state St is Markov if and only if:
! "#$%|"# = ! "#$% "%, … , "*]
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Expected Return
• Expected return: !" : sequence of rewards, potentially discounted by a factor
# where # ∈ 0,1
!" = )"*+ + #)"*- + #-)"*. + … = 0
123
4
#1)"*1*+
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bellman Expectation Equations
!" # = % &' (' = # = % )'*+ + -!" #'*+ #' = #
Value of s is the expected return at state s following policy
. subsequently.
This is Bellman Expectation Equation that can be also
expressed as action-value function for policy .
/" #, 1 = % )'*+ + -/" #'*+, 1'*+ #' = #, 1' = 1
= ℛ3
4 + - 5
3678
9336
4
!" #′
Value of taking action a at state s under policy .
.
# → !"(#)
#, 1 → /"(#, 1)
#′ → !"(#′)
1
#′
>
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bellman Equations - Example
5.5
510 -3
! " = 10 × .5 + 5 × .25 + −3 × .25 = 5.5
4.4
2
R=5
P=.5 R=2
P=.5
5
P=.4 P=.5
! " = 5 × .5 + .5[.4 × 2 + .5 × 5 + .1 × 4.4] = 4.4
P=.5
P=.25
P=.25
P=.1
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Optimal Policy
max
s,a
r
s’
a’ s’
s
a
!
p r
max
9
510 -3
"∗ $ = max{−1 + 10, +2 + 5, +3 − 3} = 9
R = -1
R = 2
R = 3
A	policy	is	better	if	"B $ ≥ "BD $ ∀ $ ∈ G
"∗ s ≡ max "B $ ∀ $ ∈ G
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Iterative Policy Evaluation
1 2 3
4 5 6 7
8 9 10 11
12 13 14
!" = −1
& → . = & ↑ . =
& ↓ . = & ← . = .25
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Iterative Policy Evaluation
0.00
1
0.00
2
0.00
3
0.00
4
0.00
5
0.00
6
0.00
7
0.00
8
0.00
9
0.00
10
0.00
11
0.00
12
0.00
13
0.00
14
!" = −1
& = 0
0
15
(: !*+,-. /-0123
( → . = ( ↑ . =
( ↓ . = ( ← . = .25
9" = −1
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Iterative Policy Evaluation
!"#$ 1 = .25× −1 + 0.(0)
"#2 →
+.25× −1 + 0.($)
"#2 ↑
+
.25× −1 + 0.(5)
"#2 ↓
+.25× −1 + 0.(7)
"#2 ←
= −.25 − .25 − .25 − .25 = −9
-1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00
0
0
0.00
1
0.00 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00
0
0
: → . = : ↑ . =
: ↓ . = : ← . = .25
;< = −1
= = 0 = = 1
!"#$ 7 =.25× −1 + 0.(?)
"#2 →
+.25× −1 + 0.(@)
"#2 ↑
+
.25× −1 + 0.($$)
"#2 ↓
+.25× −1 + 0.(A)
"#2 ←
= −.25 − .25 − .25 − .25 = −9
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Iterative Policy Evaluation
!"#$ 1
=.25× −1 + −1.00.($)
"#1 →
+ . 25× −1 + −1.00.(1)
"#1 ↑
+
.25× −1 + −1.00.(4)
"#1 ↓
+.25× −1 + 0.(6)
"#1 ←
= .25× −8 − 8 − 8 − 9 = −9. :;
-1.75 -2.00 -2.00
-2.00 -2.00 -2.00 -2.00
-2.00 -2.00 -2.00 -1.75
-2.00 -2.00 -1.75
0
0
-1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00
0
0
!"#$ 7
= −1 ×.25 − 1.00.(=)
"#1 →
+ −1 ×.25 − 1.00.(>)
"#1 ↑
+
−1 ×.25 − 1.00.(11)
"#1 ↓
+ −1 ×.25 − 1.00
.
.(?)
"#1 ←
=
= .25× −8 − 8 − 8 − 9 = −8
@ → . = @ ↑ . =
@ ↓ . = @ ← . = .25
AB = −1
C = 1 C = 2
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Iterative Policy Evaluation
!"#$ 1
=.25× −1 + −2.00.(0)
"#0 →
+ . 25× −1 + −1.75.(4)
"#0 ↑
+
.25× −1 + −2.00.(6)
"#0 ↓
+.25× −1 + 0.(8)
"#0 ←
= .25× −: − ;. <= − : − > = −;. ?:
-2.43 -2.93 -3.00
-2.43 -2.93 -3.00 -2.93
-2.93 -3.00 -2.93 -2.43
-3.00 -2.93 -2.43
0
0
-1.75 -2.00 -2.00
-1.75 -2.00 -2.00 -2.00
-2.00 -2.00 -2.00 -1.75
-2.00 -2.00 -1.75
0
0
!"#$ 7
= −1 ×.25 − 2.00.(@)
"#0 →
+ −1 ×.25 − 2.00.($)
"#0 ↑
+
−1 ×.25 − 1.75.(44)
"#0 ↓
+ −1 ×.25 − 2.00
.
.(A)
"#0 ←
=
.25× −: − : − ;. <= − : = −;.93
D → . = D ↑ . =
D ↓ . = D ← . = .25
EF = −1
G = 2 G = 3
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Policy Improvement and Control
! "
! → $%
! → &'(()*(")
Evaluation
Improvement
!∗
"∗
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Policy Improvement and Control
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
GridWorld Demo
https://github.com/rlcode/reinforcement-learning
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Limitation of Dynamic Programming
• Assumption of full knowledge of MDP
• DP is using full-width backup.
• Number of states can grow rapidly.
• Suitable for medium problem of just a few million states.
…
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monte Carlo Learning
• Model-Free learning
• Learning from episode of experience
• All episodes much have a terminal state
… …
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Temporal Difference (TD) Learning
• Learning from episodes of experience.
• Model-Free
• TD learns from incomplete episodes.
• Updating an estimate towards an estimate.
…
TD(1)
TD(2)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Exploration and Exploitation
• Exploitation is maximizing reward using known
information about a system.
• Going to school, applying to college, choosing a degree such as engineering and
medicine that has a better comparative yield, graduating as quickly as possible through
taking all the recommended degree courses, getting a job, putting money in retirement
schemes, retiring at a middle-class house comfortably.
• Always following a system based on known information
results in missing out on potentials for better results.
• Going to school, applying to college, choosing a degree such as engineering and
medicine that has a better comparative yield, taking a course in Neural Networks out of
curiosity, changing subject, graduating, starting an AI company, growing the company,
becoming a billionaire, never retiring J
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Q-Learning
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Q-Learning
• The Q learning updates the Q value, slightly in the
direction of best possible next Q value.
s,a
r
s’
max
! ", $ ← ! ", $ + '() + * max
./
! "′, 1′ − !(", $))
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Q-Learning Properties
• Model-free
• Change of task (reinforcement) requires re-training
• A special kind of Temporal Difference learning
• Convergence assured only for Markov states
• Tabular approach requires every observed state-action
pair to have an entry
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Action Selection
• Greedy – always pick the actions with highest value
• Break ties randomly
• !-greedy – choose random with low probability !
• Softmax – always choose randomly, weighted by
respective Q-values
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reinforcement Function
• Implicitly supplies the goal to the agent
• Designing the function is an art
• Mistakes result in agent learning wrong behavior
• When need to learn behavior with shortest duration,
penalize every action a little for “wasting time”.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Q –Learning Demos
https://github.com/dbatalov/reinforcement-learning
Rocket Lander DemoGrid World Demo
https://github.com/rlcode/reinforcement-learning
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Tabular Approach and its Limitation
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DQN
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Universal Function Approximation Theorem
• Let $ 0 &' ( )*+,-(+-, &*/+0'0, (+0 1*+*-2)(334 2+)5'(,2+6 7/+)-2*+.
• 9'- :; 0'+*-' -ℎ' 1 021'+-2*+(3 /+2- ℎ4='5)/&' 0,1 ;.
• ?ℎ' @=()' *7 )*+-2+/*/, 7/+)-2*+, *+ :; 2, 0'+*-'0 &4 A :; .
?ℎ'+ 62B'+ C > 0 (+0 (+4 7/+)-2*+ 7CA :; , -ℎ'5' 'E2,-,
• (+ 2+-'6'5 F
• 5'(3 )*+,-(+-, BG, &GCℝ
• 5'(3 B')-*5, IGCℝ;
, Iℎ'5' 2 = 1,2, … , F, ,/)ℎ -ℎ(- I' 1(4 0'72+':
N E = O
GPQ
R
BG$(IG
T
E + &G )
(, (+ (==5*E21(-2*+ 5'(32W(-2*+ *7) 7/+)-2*+ 7 Iℎ'5' 2, 2+0'='+0'+- *7 $; -ℎ(- 2,
N E − 7 E < C
7*5 (33 E 2+ :;
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Deep Reinforcement Learning
• An Artificial Neural Network is a
Universal Function
Approximator.
• We can use a ANN as an
approximation of an agent to
choose what action to take to
maximize reward.
Check this link for proof of the theorem:
https://en.wikipedia.org/wiki/Universal_approximation_theorem
David Silver
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DQN Network
https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf
• DQN Agent achieves >75%
of the human score in 29
our of 49 games
• DQN Agent beats human
score (>100%) in 22 games
!"#$%% =
()*%+, !"#$% − ./+0#1 23/4 !"#$%)
(671/+ !"#$% − ./+0#1 23/4 !"#$%)
8 100
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DQN for Breakout
https://github.com/apache/incubator-mxnet/tree/master/example/reinforcement-learning/dqn
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DQN Algorithm
• Techniques that increase stability and better convergence
• !- greedy Exploration
• Technique: Choose action as per optimal policy with (1-") and random action
with " probability
• Advantage: Minimize overfitting of the network
• Experience (#$, &$, '$, #$()) Replay
• Technique: Store agent’s experiences and use samples from them to update Q-
network
• Advantage: Removes correlations in observation sequence
• Periodic update of Q towards target
• Technique: Every C updates, clone the Q-network and used cloned (*Q) for
generating target for the following C updates to Q-network
• Advantage: Reduces correlations with the target
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DQN-Algorithm
DQN
(Cloned) DQN
!"#$, &"#$, '"#$, !"
!"#(, &"#(, '"#(, !"#$
!"#), &"#), '"#), !"#(
!"#*, &"#*, '"#*, !"#*+$
Initialize replay memory (N = 1M)
Random play
Initialize DQNs with random ,-
./ !, &; ,1
# / !, &; ,1
Episode 1: Select 2$ and get !$
Time step 1:
&$ = 4
'&5678 &9:;75, <' = =
&'>8&?@/ !$, &; ,- , AB2A
Observe reward '$ and move to 2(
Add (!$, &$, '$, !() to D
Generate training data:
U(D) = Random sample of D
For each !E, &E, 'E, !E+$ ∈ G H :
IE = J
'E, AK;276A :A8;5&:A2
'E + M max
@Q
./ !E+$, &R; ,-
#
,1
#
= ,-
,1 = ,-
!$
S !$, . ; ,-
!E+$
US !E+$, . ; ,- Update DQN using U(D) with ys
,1 = ,$
Time step 2:
&( = 4
'&5678 &9:;75, <' = =
&'>8&?@/ !(, &; ,$ , AB2A
Observe reward '( and move to 2)
Add (!(, &(, '(, !)) to D
!(
S !(, . ; ,$
,1
#
= ,$-V ,1 = ,$-V
Every 10K steps, Clone DQN: ,1
#
= ,1
,1 = ,(
Episode 2: Select 2$ and get !$Episode m: Select 2$ and get !$
Time step t:
&" = 4
'&5678 &9:;75, <' = =
&'>8&?@/ !", &; ,1 , AB2A
Observe reward '" and move to 2"+$
Add (!", &", '", !"+$) to D
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Function Approximation
!∗ #, % = '() * + , max
0)
!∗ #1, %1 |#, %
!3 #, % = '() * + , max
0)
!345 #1
, %1
|#, %
!∗ #, % ≈ !(#, %; 9)
! #, %; 93 ≈ '() * + , max
0)
! #1, %1; 93
4
|#, %
!3 → !∗
%# < → ∞
>3 93 = '(,0,? '() @|#, % − !(#, %; 93) B
where, @ = * + , max
0)
! #1
, %1
; 93
4
Bellman equation
Iterative update
Function Approximation
Modified Iterative update
Loss function to minimize
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Network Architecture
https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf
84 X 84 X 4
!(#) %(#, '; ))
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Deep Convolutional Network - Nature
DQN = gluon.nn.Sequential()
with DQN.name_scope():
#first layer
DQN.add(gluon.nn.Conv2D(channels=32, kernel_size=8,strides = 4,padding = 0))
DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))
DQN.add(gluon.nn.Activation('relu'))
#second layer
DQN.add(gluon.nn.Conv2D(channels=64, kernel_size=4,strides = 2))
DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))
DQN.add(gluon.nn.Activation('relu'))
#tird layer
DQN.add(gluon.nn.Conv2D(channels=64, kernel_size=3,strides = 1))
DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))
DQN.add(gluon.nn.Activation('relu'))
DQN.add(gluon.nn.Flatten())
#fourth layer
DQN.add(gluon.nn.Dense(512,activation ='relu'))
#fifth layer
DQN.add(gluon.nn.Dense(num_action,activation ='relu'))
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Issues with DQN
• Q-Learning does overestimate action values due to
maximization term over estimated values.
• Over-estimation is being associated with noise and
insufficiently flexible function approximation.
• DQN provides a flexible function approximation.
• Deterministic nature of Atari games eliminates noise.
• DQN still significantly overestimates action values.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Double Q Learning and DDQN
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Double Q-Learning
• The max operator uses the same values for evaluation and action
selection. This leads to over-optimism
• Decoupling evaluation and action-selection can prevent
overoptimization. This is the idea behind Double Q-Learning.
• In Double QL two value functions are learned by randomly assigning
experiences to update either of the two, resulting in two sets of
weights, ! and !′.
• For each update one set of weights is used to determine greedy
policy and the other for determining its value.
#$
%
≡ '$() + + max
/
0 1$(), 3; !$
#$
5%6
≡ '$() + + max
/
0 1$(), 3; !$
7
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Untangling Evaluation and
• For action selection we are using !
• For evaluation we are using !′.
#$
%
≡ '$() + + max
/
0 1$(), 3; !$ → #$
%
≡ '$() + +0 1$(), argmax
8
0 1$(), 3; !$ ; !$
#$
9%:
≡ '$() + + max
/
0 1$(), 3; !$
;
→ #$
9<=>?@%
≡ '$() + +0 1$(), argmax
8
0 1$(), 3; !$ ; !′$
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Over-optimism and Error Estimation – upper bound
• Thurn and Schwartz showed that the upper bound of
error due to over-optimization is where action values are
uniformly distributed in an interval [−#, #] is &#
'()
'*)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Over-optimism and Error Estimation – lower bound
• Consider state s at which !∗ #, % = '∗ # ()* #)+, '∗ # .
• Let !. be are arbitrary value estimates that are on the
whole unbiased so that ∑0(!. #, % − '∗ 3 ), but are not
all correct, such that
5
6
∑0 !. #, % − '∗ 3
7
= 8 for
some 8 > 0, where + > 2 is the number of actions in #.
• Then max
0
!. #, % ≥ '∗ # +
A
6B5
.
• The lower bound is tight. The lower bound on the
absolute error of the Double Q-Learning is zero.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Number of Actions and Bias
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bias in Q-Learning vs Double Q-Learning
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DDQN
• Using DQN’s target network for value estimation
• Using DQN’s online network for evaluating greedy
policy.
!"
#$%&'(#)*
≡ ,"-. + 01 2"-., argmax
9
1 2"-., :; <" ; <′"
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Results
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bayesian DQN or BDQN
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Focusing on Efficient Exploration
• The central claim is that mechanism such as !-greedy
exploration are inefficient.
• Thompson Sampling allows for targeted exploration at
higher dimension but is computationally too expensive.
• BDQN targets to implement Thompson Sampling at
scale though function approximation.
• BDQN combines DQN with a BLR (Bayesian Linear
Regression) model on the last layer.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thomson Sampling
• Thompson sampling involves maintaining a prior
distribution over the environment models (reward
and/or dynamics)
• The distribution is updated as observations are made
• To choose an action, a sample from the posterior belief
is drawn and an action is selected that maximizes the
expected return under the sampled belief.
• For more information please refer to,”A Tutorial on
Thompson Sampling, Daniel Russo et al.”
https://arxiv.org/abs/1707.02038
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
TS vs !-Greedy
• !-greedy focuses on greedy action.
• TS explores actions with higher estimated return with
higher probability.
• TS based strategy advances the
exploration/exploitation balance by making a trade-off
between the expected returns and the uncertainties,
while ε− greedy strategy ignores all of this information.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
TS vs !-Greedy
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
TS vs !-Greedy
• TS finds optimal Q-Function faster.
• Randomizes over Q-Functions with high promising returns
and high uncertainty.
• When true Q-Function is selected, it increases posterior
probability.
• When other function are selected, wrong values are
estimated and the posterior probability is set to zero.
• !-greedy agent randomizes its action with probability of
!, even after having chosen the true Q-Function,
therefore, it takes exponentially many trials in order to
get to the target.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
BDQN Algorithm
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Network Architecture
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
BDQN Performance
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Closing Words
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Value Alignment
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
References
• DQN: https://www.nature.com/articles/nature14236
• DDQN: https://arxiv.org/abs/1509.06461
• BDQN: https://arxiv.org/abs/1802.04412
• DQN MXNet Code:
• https://github.com/apache/incubator-mxnet/tree/master/example/reinforcement-learning/dqn
• DQN MXNet/Gluon Code:https://github.com/zackchase/mxnet-the-straight-
dope/blob/master/chapter17_deep-reinforcement-learning/DQN.ipynb
• DDQN MXNet/Gluon Code: https://github.com/zackchase/mxnet-the-straight-
dope/blob/master/chapter17_deep-reinforcement-learning/DDQN.ipynb
• BDQN MXNet/Gluon Code: https://github.com/kazizzad/BDQN-MxNet-Gluon
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

More Related Content

Similar to Game Playing RL Agent

RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019
RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019 RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019
RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019 AWSKRUG - AWS한국사용자모임
 
Deep Dive on MySQL Databases on Amazon RDS (DAT322) - AWS re:Invent 2018
Deep Dive on MySQL Databases on Amazon RDS (DAT322) - AWS re:Invent 2018Deep Dive on MySQL Databases on Amazon RDS (DAT322) - AWS re:Invent 2018
Deep Dive on MySQL Databases on Amazon RDS (DAT322) - AWS re:Invent 2018Amazon Web Services
 
Building Content Recommendation Systems Using Apache MXNet and Gluon - MCL402...
Building Content Recommendation Systems Using Apache MXNet and Gluon - MCL402...Building Content Recommendation Systems Using Apache MXNet and Gluon - MCL402...
Building Content Recommendation Systems Using Apache MXNet and Gluon - MCL402...Amazon Web Services
 
Speed up your Machine Learning workflows with built-in algorithms - Tel Aviv ...
Speed up your Machine Learning workflows with built-in algorithms - Tel Aviv ...Speed up your Machine Learning workflows with built-in algorithms - Tel Aviv ...
Speed up your Machine Learning workflows with built-in algorithms - Tel Aviv ...Amazon Web Services
 
Speed up your Machine Learning workflows with build-in algorithms
Speed up your Machine Learning workflows with build-in algorithmsSpeed up your Machine Learning workflows with build-in algorithms
Speed up your Machine Learning workflows with build-in algorithmsJulien SIMON
 
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...Amazon Web Services
 
Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018
Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018
Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018Amazon Web Services
 
Neptune, the Graph Database | AWS Floor28
Neptune, the Graph Database | AWS Floor28Neptune, the Graph Database | AWS Floor28
Neptune, the Graph Database | AWS Floor28Amazon Web Services
 
Machine Learning Fundamentals
Machine Learning FundamentalsMachine Learning Fundamentals
Machine Learning FundamentalsSigOpt
 
SageMaker Algorithms Infinitely Scalable Machine Learning
SageMaker Algorithms Infinitely Scalable Machine LearningSageMaker Algorithms Infinitely Scalable Machine Learning
SageMaker Algorithms Infinitely Scalable Machine LearningAmazon Web Services
 
Building a Recommender System on AWS
Building a Recommender System on AWSBuilding a Recommender System on AWS
Building a Recommender System on AWSAmazon Web Services
 
Building Content Recommendation Systems using MXNet Gluon
Building Content Recommendation Systems using MXNet GluonBuilding Content Recommendation Systems using MXNet Gluon
Building Content Recommendation Systems using MXNet GluonApache MXNet
 
Neural Machine Translation with Sockeye
Neural Machine Translation with SockeyeNeural Machine Translation with Sockeye
Neural Machine Translation with SockeyeApache MXNet
 
Keynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos EngineeringKeynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos EngineeringAmazon Web Services
 
The Enterprise Architecture You Always Wanted
The Enterprise Architecture You Always WantedThe Enterprise Architecture You Always Wanted
The Enterprise Architecture You Always WantedThoughtworks
 
Understanding Graph Databases: AWS Developer Workshop at Web Summit
Understanding Graph Databases: AWS Developer Workshop at Web SummitUnderstanding Graph Databases: AWS Developer Workshop at Web Summit
Understanding Graph Databases: AWS Developer Workshop at Web SummitAmazon Web Services
 
Amazon SageMaker Algorithms: Machine Learning Week San Francisco
Amazon SageMaker Algorithms: Machine Learning Week San FranciscoAmazon SageMaker Algorithms: Machine Learning Week San Francisco
Amazon SageMaker Algorithms: Machine Learning Week San FranciscoAmazon Web Services
 
London Microservices Meetup: Lessons learnt adopting microservices
London Microservices  Meetup: Lessons learnt adopting microservicesLondon Microservices  Meetup: Lessons learnt adopting microservices
London Microservices Meetup: Lessons learnt adopting microservicesCobus Bernard
 
The Theory and Math Behind Data Privacy and Security Assurance (SEC301) - AWS...
The Theory and Math Behind Data Privacy and Security Assurance (SEC301) - AWS...The Theory and Math Behind Data Privacy and Security Assurance (SEC301) - AWS...
The Theory and Math Behind Data Privacy and Security Assurance (SEC301) - AWS...Amazon Web Services
 
Amazon sage maker infinitely scalable machine learning algorithms
Amazon sage maker infinitely scalable machine learning algorithmsAmazon sage maker infinitely scalable machine learning algorithms
Amazon sage maker infinitely scalable machine learning algorithmsMLconf
 

Similar to Game Playing RL Agent (20)

RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019
RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019 RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019
RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019
 
Deep Dive on MySQL Databases on Amazon RDS (DAT322) - AWS re:Invent 2018
Deep Dive on MySQL Databases on Amazon RDS (DAT322) - AWS re:Invent 2018Deep Dive on MySQL Databases on Amazon RDS (DAT322) - AWS re:Invent 2018
Deep Dive on MySQL Databases on Amazon RDS (DAT322) - AWS re:Invent 2018
 
Building Content Recommendation Systems Using Apache MXNet and Gluon - MCL402...
Building Content Recommendation Systems Using Apache MXNet and Gluon - MCL402...Building Content Recommendation Systems Using Apache MXNet and Gluon - MCL402...
Building Content Recommendation Systems Using Apache MXNet and Gluon - MCL402...
 
Speed up your Machine Learning workflows with built-in algorithms - Tel Aviv ...
Speed up your Machine Learning workflows with built-in algorithms - Tel Aviv ...Speed up your Machine Learning workflows with built-in algorithms - Tel Aviv ...
Speed up your Machine Learning workflows with built-in algorithms - Tel Aviv ...
 
Speed up your Machine Learning workflows with build-in algorithms
Speed up your Machine Learning workflows with build-in algorithmsSpeed up your Machine Learning workflows with build-in algorithms
Speed up your Machine Learning workflows with build-in algorithms
 
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...
 
Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018
Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018
Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018
 
Neptune, the Graph Database | AWS Floor28
Neptune, the Graph Database | AWS Floor28Neptune, the Graph Database | AWS Floor28
Neptune, the Graph Database | AWS Floor28
 
Machine Learning Fundamentals
Machine Learning FundamentalsMachine Learning Fundamentals
Machine Learning Fundamentals
 
SageMaker Algorithms Infinitely Scalable Machine Learning
SageMaker Algorithms Infinitely Scalable Machine LearningSageMaker Algorithms Infinitely Scalable Machine Learning
SageMaker Algorithms Infinitely Scalable Machine Learning
 
Building a Recommender System on AWS
Building a Recommender System on AWSBuilding a Recommender System on AWS
Building a Recommender System on AWS
 
Building Content Recommendation Systems using MXNet Gluon
Building Content Recommendation Systems using MXNet GluonBuilding Content Recommendation Systems using MXNet Gluon
Building Content Recommendation Systems using MXNet Gluon
 
Neural Machine Translation with Sockeye
Neural Machine Translation with SockeyeNeural Machine Translation with Sockeye
Neural Machine Translation with Sockeye
 
Keynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos EngineeringKeynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos Engineering
 
The Enterprise Architecture You Always Wanted
The Enterprise Architecture You Always WantedThe Enterprise Architecture You Always Wanted
The Enterprise Architecture You Always Wanted
 
Understanding Graph Databases: AWS Developer Workshop at Web Summit
Understanding Graph Databases: AWS Developer Workshop at Web SummitUnderstanding Graph Databases: AWS Developer Workshop at Web Summit
Understanding Graph Databases: AWS Developer Workshop at Web Summit
 
Amazon SageMaker Algorithms: Machine Learning Week San Francisco
Amazon SageMaker Algorithms: Machine Learning Week San FranciscoAmazon SageMaker Algorithms: Machine Learning Week San Francisco
Amazon SageMaker Algorithms: Machine Learning Week San Francisco
 
London Microservices Meetup: Lessons learnt adopting microservices
London Microservices  Meetup: Lessons learnt adopting microservicesLondon Microservices  Meetup: Lessons learnt adopting microservices
London Microservices Meetup: Lessons learnt adopting microservices
 
The Theory and Math Behind Data Privacy and Security Assurance (SEC301) - AWS...
The Theory and Math Behind Data Privacy and Security Assurance (SEC301) - AWS...The Theory and Math Behind Data Privacy and Security Assurance (SEC301) - AWS...
The Theory and Math Behind Data Privacy and Security Assurance (SEC301) - AWS...
 
Amazon sage maker infinitely scalable machine learning algorithms
Amazon sage maker infinitely scalable machine learning algorithmsAmazon sage maker infinitely scalable machine learning algorithms
Amazon sage maker infinitely scalable machine learning algorithms
 

More from Apache MXNet

Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingApache MXNet
 
Fine-tuning BERT for Question Answering
Fine-tuning BERT for Question AnsweringFine-tuning BERT for Question Answering
Fine-tuning BERT for Question AnsweringApache MXNet
 
Introduction to GluonNLP
Introduction to GluonNLPIntroduction to GluonNLP
Introduction to GluonNLPApache MXNet
 
Introduction to object tracking with Deep Learning
Introduction to object tracking with Deep LearningIntroduction to object tracking with Deep Learning
Introduction to object tracking with Deep LearningApache MXNet
 
Introduction to GluonCV
Introduction to GluonCVIntroduction to GluonCV
Introduction to GluonCVApache MXNet
 
Introduction to Computer Vision
Introduction to Computer VisionIntroduction to Computer Vision
Introduction to Computer VisionApache MXNet
 
Image Segmentation: Approaches and Challenges
Image Segmentation: Approaches and ChallengesImage Segmentation: Approaches and Challenges
Image Segmentation: Approaches and ChallengesApache MXNet
 
Introduction to Deep face detection and recognition
Introduction to Deep face detection and recognitionIntroduction to Deep face detection and recognition
Introduction to Deep face detection and recognitionApache MXNet
 
Generative Adversarial Networks (GANs) using Apache MXNet
Generative Adversarial Networks (GANs) using Apache MXNetGenerative Adversarial Networks (GANs) using Apache MXNet
Generative Adversarial Networks (GANs) using Apache MXNetApache MXNet
 
Deep Learning With Apache MXNet On Video by Ben Taylor @ ziff.ai
Deep Learning With Apache MXNet On Video by Ben Taylor @ ziff.aiDeep Learning With Apache MXNet On Video by Ben Taylor @ ziff.ai
Deep Learning With Apache MXNet On Video by Ben Taylor @ ziff.aiApache MXNet
 
Using Java to deploy Deep Learning models with MXNet
Using Java to deploy Deep Learning models with MXNetUsing Java to deploy Deep Learning models with MXNet
Using Java to deploy Deep Learning models with MXNetApache MXNet
 
AI powered emotion recognition: From Inception to Production - Global AI Conf...
AI powered emotion recognition: From Inception to Production - Global AI Conf...AI powered emotion recognition: From Inception to Production - Global AI Conf...
AI powered emotion recognition: From Inception to Production - Global AI Conf...Apache MXNet
 
MXNet Paris Workshop - Intro To MXNet
MXNet Paris Workshop - Intro To MXNetMXNet Paris Workshop - Intro To MXNet
MXNet Paris Workshop - Intro To MXNetApache MXNet
 
Apache MXNet ODSC West 2018
Apache MXNet ODSC West 2018Apache MXNet ODSC West 2018
Apache MXNet ODSC West 2018Apache MXNet
 
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018Apache MXNet
 
Apache MXNet EcoSystem - ACNA2018
Apache MXNet EcoSystem - ACNA2018Apache MXNet EcoSystem - ACNA2018
Apache MXNet EcoSystem - ACNA2018Apache MXNet
 
ONNX and Edge Deployments
ONNX and Edge DeploymentsONNX and Edge Deployments
ONNX and Edge DeploymentsApache MXNet
 
Distributed Inference with MXNet and Spark
Distributed Inference with MXNet and SparkDistributed Inference with MXNet and Spark
Distributed Inference with MXNet and SparkApache MXNet
 
AI On the Edge: Model Compression
AI On the Edge: Model CompressionAI On the Edge: Model Compression
AI On the Edge: Model CompressionApache MXNet
 
Debugging and Performance tricks for MXNet Gluon
Debugging and Performance tricks for MXNet GluonDebugging and Performance tricks for MXNet Gluon
Debugging and Performance tricks for MXNet GluonApache MXNet
 

More from Apache MXNet (20)

Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language Processing
 
Fine-tuning BERT for Question Answering
Fine-tuning BERT for Question AnsweringFine-tuning BERT for Question Answering
Fine-tuning BERT for Question Answering
 
Introduction to GluonNLP
Introduction to GluonNLPIntroduction to GluonNLP
Introduction to GluonNLP
 
Introduction to object tracking with Deep Learning
Introduction to object tracking with Deep LearningIntroduction to object tracking with Deep Learning
Introduction to object tracking with Deep Learning
 
Introduction to GluonCV
Introduction to GluonCVIntroduction to GluonCV
Introduction to GluonCV
 
Introduction to Computer Vision
Introduction to Computer VisionIntroduction to Computer Vision
Introduction to Computer Vision
 
Image Segmentation: Approaches and Challenges
Image Segmentation: Approaches and ChallengesImage Segmentation: Approaches and Challenges
Image Segmentation: Approaches and Challenges
 
Introduction to Deep face detection and recognition
Introduction to Deep face detection and recognitionIntroduction to Deep face detection and recognition
Introduction to Deep face detection and recognition
 
Generative Adversarial Networks (GANs) using Apache MXNet
Generative Adversarial Networks (GANs) using Apache MXNetGenerative Adversarial Networks (GANs) using Apache MXNet
Generative Adversarial Networks (GANs) using Apache MXNet
 
Deep Learning With Apache MXNet On Video by Ben Taylor @ ziff.ai
Deep Learning With Apache MXNet On Video by Ben Taylor @ ziff.aiDeep Learning With Apache MXNet On Video by Ben Taylor @ ziff.ai
Deep Learning With Apache MXNet On Video by Ben Taylor @ ziff.ai
 
Using Java to deploy Deep Learning models with MXNet
Using Java to deploy Deep Learning models with MXNetUsing Java to deploy Deep Learning models with MXNet
Using Java to deploy Deep Learning models with MXNet
 
AI powered emotion recognition: From Inception to Production - Global AI Conf...
AI powered emotion recognition: From Inception to Production - Global AI Conf...AI powered emotion recognition: From Inception to Production - Global AI Conf...
AI powered emotion recognition: From Inception to Production - Global AI Conf...
 
MXNet Paris Workshop - Intro To MXNet
MXNet Paris Workshop - Intro To MXNetMXNet Paris Workshop - Intro To MXNet
MXNet Paris Workshop - Intro To MXNet
 
Apache MXNet ODSC West 2018
Apache MXNet ODSC West 2018Apache MXNet ODSC West 2018
Apache MXNet ODSC West 2018
 
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
 
Apache MXNet EcoSystem - ACNA2018
Apache MXNet EcoSystem - ACNA2018Apache MXNet EcoSystem - ACNA2018
Apache MXNet EcoSystem - ACNA2018
 
ONNX and Edge Deployments
ONNX and Edge DeploymentsONNX and Edge Deployments
ONNX and Edge Deployments
 
Distributed Inference with MXNet and Spark
Distributed Inference with MXNet and SparkDistributed Inference with MXNet and Spark
Distributed Inference with MXNet and Spark
 
AI On the Edge: Model Compression
AI On the Edge: Model CompressionAI On the Edge: Model Compression
AI On the Edge: Model Compression
 
Debugging and Performance tricks for MXNet Gluon
Debugging and Performance tricks for MXNet GluonDebugging and Performance tricks for MXNet Gluon
Debugging and Performance tricks for MXNet Gluon
 

Recently uploaded

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 

Recently uploaded (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 

Game Playing RL Agent

  • 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Principal Solutions Architect – AWS Deep Learning Amazon Web Services Game Playing RL Agent
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Inspiration From Nature https://newatlas.com/bae-smartskin/33458/ https://www.nasa.gov/ames/feature/go-go-green-wing-mighty-morphing-materials-in-aircraft-design
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hardware of Learning http://www.biologyreference.com/Mo-Nu/Neuron.html
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hardware of Learning http://www.biologyreference.com/Mo-Nu/Neuron.html I1 I2 B O w2 w3 ! "#, %# = Φ() + Σ#(%#. "#)) Φ " = . 1, 0! " ≥ 0.5 0, 0! " < 0.5 w1 5 ∧ Q 8 9 8 ∧ Q : : : : ; ; ; : ; ; : ;
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Process of Learning
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Process of Learning Agent Environment!"#$ !" state %"#$ %" reward &" action Sutton and Barto
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Markov State An information state (a.k.a. Markov state) contains all useful information from the history. A state St is Markov if and only if: ! "#$%|"# = ! "#$% "%, … , "*]
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Expected Return • Expected return: !" : sequence of rewards, potentially discounted by a factor # where # ∈ 0,1 !" = )"*+ + #)"*- + #-)"*. + … = 0 123 4 #1)"*1*+
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Bellman Expectation Equations !" # = % &' (' = # = % )'*+ + -!" #'*+ #' = # Value of s is the expected return at state s following policy . subsequently. This is Bellman Expectation Equation that can be also expressed as action-value function for policy . /" #, 1 = % )'*+ + -/" #'*+, 1'*+ #' = #, 1' = 1 = ℛ3 4 + - 5 3678 9336 4 !" #′ Value of taking action a at state s under policy . . # → !"(#) #, 1 → /"(#, 1) #′ → !"(#′) 1 #′ >
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Bellman Equations - Example 5.5 510 -3 ! " = 10 × .5 + 5 × .25 + −3 × .25 = 5.5 4.4 2 R=5 P=.5 R=2 P=.5 5 P=.4 P=.5 ! " = 5 × .5 + .5[.4 × 2 + .5 × 5 + .1 × 4.4] = 4.4 P=.5 P=.25 P=.25 P=.1
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Optimal Policy max s,a r s’ a’ s’ s a ! p r max 9 510 -3 "∗ $ = max{−1 + 10, +2 + 5, +3 − 3} = 9 R = -1 R = 2 R = 3 A policy is better if "B $ ≥ "BD $ ∀ $ ∈ G "∗ s ≡ max "B $ ∀ $ ∈ G
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Iterative Policy Evaluation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 !" = −1 & → . = & ↑ . = & ↓ . = & ← . = .25
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Iterative Policy Evaluation 0.00 1 0.00 2 0.00 3 0.00 4 0.00 5 0.00 6 0.00 7 0.00 8 0.00 9 0.00 10 0.00 11 0.00 12 0.00 13 0.00 14 !" = −1 & = 0 0 15 (: !*+,-. /-0123 ( → . = ( ↑ . = ( ↓ . = ( ← . = .25 9" = −1
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Iterative Policy Evaluation !"#$ 1 = .25× −1 + 0.(0) "#2 → +.25× −1 + 0.($) "#2 ↑ + .25× −1 + 0.(5) "#2 ↓ +.25× −1 + 0.(7) "#2 ← = −.25 − .25 − .25 − .25 = −9 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 0 0 0.00 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0 : → . = : ↑ . = : ↓ . = : ← . = .25 ;< = −1 = = 0 = = 1 !"#$ 7 =.25× −1 + 0.(?) "#2 → +.25× −1 + 0.(@) "#2 ↑ + .25× −1 + 0.($$) "#2 ↓ +.25× −1 + 0.(A) "#2 ← = −.25 − .25 − .25 − .25 = −9
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Iterative Policy Evaluation !"#$ 1 =.25× −1 + −1.00.($) "#1 → + . 25× −1 + −1.00.(1) "#1 ↑ + .25× −1 + −1.00.(4) "#1 ↓ +.25× −1 + 0.(6) "#1 ← = .25× −8 − 8 − 8 − 9 = −9. :; -1.75 -2.00 -2.00 -2.00 -2.00 -2.00 -2.00 -2.00 -2.00 -2.00 -1.75 -2.00 -2.00 -1.75 0 0 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 0 0 !"#$ 7 = −1 ×.25 − 1.00.(=) "#1 → + −1 ×.25 − 1.00.(>) "#1 ↑ + −1 ×.25 − 1.00.(11) "#1 ↓ + −1 ×.25 − 1.00 . .(?) "#1 ← = = .25× −8 − 8 − 8 − 9 = −8 @ → . = @ ↑ . = @ ↓ . = @ ← . = .25 AB = −1 C = 1 C = 2
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Iterative Policy Evaluation !"#$ 1 =.25× −1 + −2.00.(0) "#0 → + . 25× −1 + −1.75.(4) "#0 ↑ + .25× −1 + −2.00.(6) "#0 ↓ +.25× −1 + 0.(8) "#0 ← = .25× −: − ;. <= − : − > = −;. ?: -2.43 -2.93 -3.00 -2.43 -2.93 -3.00 -2.93 -2.93 -3.00 -2.93 -2.43 -3.00 -2.93 -2.43 0 0 -1.75 -2.00 -2.00 -1.75 -2.00 -2.00 -2.00 -2.00 -2.00 -2.00 -1.75 -2.00 -2.00 -1.75 0 0 !"#$ 7 = −1 ×.25 − 2.00.(@) "#0 → + −1 ×.25 − 2.00.($) "#0 ↑ + −1 ×.25 − 1.75.(44) "#0 ↓ + −1 ×.25 − 2.00 . .(A) "#0 ← = .25× −: − : − ;. <= − : = −;.93 D → . = D ↑ . = D ↓ . = D ← . = .25 EF = −1 G = 2 G = 3
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Policy Improvement and Control ! " ! → $% ! → &'(()*(") Evaluation Improvement !∗ "∗
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Policy Improvement and Control
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. GridWorld Demo https://github.com/rlcode/reinforcement-learning
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Limitation of Dynamic Programming • Assumption of full knowledge of MDP • DP is using full-width backup. • Number of states can grow rapidly. • Suitable for medium problem of just a few million states. …
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Monte Carlo Learning • Model-Free learning • Learning from episode of experience • All episodes much have a terminal state … …
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Temporal Difference (TD) Learning • Learning from episodes of experience. • Model-Free • TD learns from incomplete episodes. • Updating an estimate towards an estimate. … TD(1) TD(2)
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Exploration and Exploitation • Exploitation is maximizing reward using known information about a system. • Going to school, applying to college, choosing a degree such as engineering and medicine that has a better comparative yield, graduating as quickly as possible through taking all the recommended degree courses, getting a job, putting money in retirement schemes, retiring at a middle-class house comfortably. • Always following a system based on known information results in missing out on potentials for better results. • Going to school, applying to college, choosing a degree such as engineering and medicine that has a better comparative yield, taking a course in Neural Networks out of curiosity, changing subject, graduating, starting an AI company, growing the company, becoming a billionaire, never retiring J
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Q-Learning
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Q-Learning • The Q learning updates the Q value, slightly in the direction of best possible next Q value. s,a r s’ max ! ", $ ← ! ", $ + '() + * max ./ ! "′, 1′ − !(", $))
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Q-Learning Properties • Model-free • Change of task (reinforcement) requires re-training • A special kind of Temporal Difference learning • Convergence assured only for Markov states • Tabular approach requires every observed state-action pair to have an entry
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Action Selection • Greedy – always pick the actions with highest value • Break ties randomly • !-greedy – choose random with low probability ! • Softmax – always choose randomly, weighted by respective Q-values
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Reinforcement Function • Implicitly supplies the goal to the agent • Designing the function is an art • Mistakes result in agent learning wrong behavior • When need to learn behavior with shortest duration, penalize every action a little for “wasting time”.
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Q –Learning Demos https://github.com/dbatalov/reinforcement-learning Rocket Lander DemoGrid World Demo https://github.com/rlcode/reinforcement-learning
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Tabular Approach and its Limitation
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DQN
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Universal Function Approximation Theorem • Let $ 0 &' ( )*+,-(+-, &*/+0'0, (+0 1*+*-2)(334 2+)5'(,2+6 7/+)-2*+. • 9'- :; 0'+*-' -ℎ' 1 021'+-2*+(3 /+2- ℎ4='5)/&' 0,1 ;. • ?ℎ' @=()' *7 )*+-2+/*/, 7/+)-2*+, *+ :; 2, 0'+*-'0 &4 A :; . ?ℎ'+ 62B'+ C > 0 (+0 (+4 7/+)-2*+ 7CA :; , -ℎ'5' 'E2,-, • (+ 2+-'6'5 F • 5'(3 )*+,-(+-, BG, &GCℝ • 5'(3 B')-*5, IGCℝ; , Iℎ'5' 2 = 1,2, … , F, ,/)ℎ -ℎ(- I' 1(4 0'72+': N E = O GPQ R BG$(IG T E + &G ) (, (+ (==5*E21(-2*+ 5'(32W(-2*+ *7) 7/+)-2*+ 7 Iℎ'5' 2, 2+0'='+0'+- *7 $; -ℎ(- 2, N E − 7 E < C 7*5 (33 E 2+ :;
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deep Reinforcement Learning • An Artificial Neural Network is a Universal Function Approximator. • We can use a ANN as an approximation of an agent to choose what action to take to maximize reward. Check this link for proof of the theorem: https://en.wikipedia.org/wiki/Universal_approximation_theorem David Silver
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DQN Network https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf • DQN Agent achieves >75% of the human score in 29 our of 49 games • DQN Agent beats human score (>100%) in 22 games !"#$%% = ()*%+, !"#$% − ./+0#1 23/4 !"#$%) (671/+ !"#$% − ./+0#1 23/4 !"#$%) 8 100
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DQN for Breakout https://github.com/apache/incubator-mxnet/tree/master/example/reinforcement-learning/dqn
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DQN Algorithm • Techniques that increase stability and better convergence • !- greedy Exploration • Technique: Choose action as per optimal policy with (1-") and random action with " probability • Advantage: Minimize overfitting of the network • Experience (#$, &$, '$, #$()) Replay • Technique: Store agent’s experiences and use samples from them to update Q- network • Advantage: Removes correlations in observation sequence • Periodic update of Q towards target • Technique: Every C updates, clone the Q-network and used cloned (*Q) for generating target for the following C updates to Q-network • Advantage: Reduces correlations with the target
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DQN-Algorithm DQN (Cloned) DQN !"#$, &"#$, '"#$, !" !"#(, &"#(, '"#(, !"#$ !"#), &"#), '"#), !"#( !"#*, &"#*, '"#*, !"#*+$ Initialize replay memory (N = 1M) Random play Initialize DQNs with random ,- ./ !, &; ,1 # / !, &; ,1 Episode 1: Select 2$ and get !$ Time step 1: &$ = 4 '&5678 &9:;75, <' = = &'>8&?@/ !$, &; ,- , AB2A Observe reward '$ and move to 2( Add (!$, &$, '$, !() to D Generate training data: U(D) = Random sample of D For each !E, &E, 'E, !E+$ ∈ G H : IE = J 'E, AK;276A :A8;5&:A2 'E + M max @Q ./ !E+$, &R; ,- # ,1 # = ,- ,1 = ,- !$ S !$, . ; ,- !E+$ US !E+$, . ; ,- Update DQN using U(D) with ys ,1 = ,$ Time step 2: &( = 4 '&5678 &9:;75, <' = = &'>8&?@/ !(, &; ,$ , AB2A Observe reward '( and move to 2) Add (!(, &(, '(, !)) to D !( S !(, . ; ,$ ,1 # = ,$-V ,1 = ,$-V Every 10K steps, Clone DQN: ,1 # = ,1 ,1 = ,( Episode 2: Select 2$ and get !$Episode m: Select 2$ and get !$ Time step t: &" = 4 '&5678 &9:;75, <' = = &'>8&?@/ !", &; ,1 , AB2A Observe reward '" and move to 2"+$ Add (!", &", '", !"+$) to D
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Function Approximation !∗ #, % = '() * + , max 0) !∗ #1, %1 |#, % !3 #, % = '() * + , max 0) !345 #1 , %1 |#, % !∗ #, % ≈ !(#, %; 9) ! #, %; 93 ≈ '() * + , max 0) ! #1, %1; 93 4 |#, % !3 → !∗ %# < → ∞ >3 93 = '(,0,? '() @|#, % − !(#, %; 93) B where, @ = * + , max 0) ! #1 , %1 ; 93 4 Bellman equation Iterative update Function Approximation Modified Iterative update Loss function to minimize
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Network Architecture https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf 84 X 84 X 4 !(#) %(#, '; ))
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deep Convolutional Network - Nature DQN = gluon.nn.Sequential() with DQN.name_scope(): #first layer DQN.add(gluon.nn.Conv2D(channels=32, kernel_size=8,strides = 4,padding = 0)) DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True)) DQN.add(gluon.nn.Activation('relu')) #second layer DQN.add(gluon.nn.Conv2D(channels=64, kernel_size=4,strides = 2)) DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True)) DQN.add(gluon.nn.Activation('relu')) #tird layer DQN.add(gluon.nn.Conv2D(channels=64, kernel_size=3,strides = 1)) DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True)) DQN.add(gluon.nn.Activation('relu')) DQN.add(gluon.nn.Flatten()) #fourth layer DQN.add(gluon.nn.Dense(512,activation ='relu')) #fifth layer DQN.add(gluon.nn.Dense(num_action,activation ='relu'))
  • 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Issues with DQN • Q-Learning does overestimate action values due to maximization term over estimated values. • Over-estimation is being associated with noise and insufficiently flexible function approximation. • DQN provides a flexible function approximation. • Deterministic nature of Atari games eliminates noise. • DQN still significantly overestimates action values.
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Double Q Learning and DDQN
  • 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Double Q-Learning • The max operator uses the same values for evaluation and action selection. This leads to over-optimism • Decoupling evaluation and action-selection can prevent overoptimization. This is the idea behind Double Q-Learning. • In Double QL two value functions are learned by randomly assigning experiences to update either of the two, resulting in two sets of weights, ! and !′. • For each update one set of weights is used to determine greedy policy and the other for determining its value. #$ % ≡ '$() + + max / 0 1$(), 3; !$ #$ 5%6 ≡ '$() + + max / 0 1$(), 3; !$ 7
  • 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Untangling Evaluation and • For action selection we are using ! • For evaluation we are using !′. #$ % ≡ '$() + + max / 0 1$(), 3; !$ → #$ % ≡ '$() + +0 1$(), argmax 8 0 1$(), 3; !$ ; !$ #$ 9%: ≡ '$() + + max / 0 1$(), 3; !$ ; → #$ 9<=>?@% ≡ '$() + +0 1$(), argmax 8 0 1$(), 3; !$ ; !′$
  • 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Over-optimism and Error Estimation – upper bound • Thurn and Schwartz showed that the upper bound of error due to over-optimization is where action values are uniformly distributed in an interval [−#, #] is &# '() '*)
  • 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Over-optimism and Error Estimation – lower bound • Consider state s at which !∗ #, % = '∗ # ()* #)+, '∗ # . • Let !. be are arbitrary value estimates that are on the whole unbiased so that ∑0(!. #, % − '∗ 3 ), but are not all correct, such that 5 6 ∑0 !. #, % − '∗ 3 7 = 8 for some 8 > 0, where + > 2 is the number of actions in #. • Then max 0 !. #, % ≥ '∗ # + A 6B5 . • The lower bound is tight. The lower bound on the absolute error of the Double Q-Learning is zero.
  • 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Number of Actions and Bias
  • 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Bias in Q-Learning vs Double Q-Learning
  • 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DDQN • Using DQN’s target network for value estimation • Using DQN’s online network for evaluating greedy policy. !" #$%&'(#)* ≡ ,"-. + 01 2"-., argmax 9 1 2"-., :; <" ; <′"
  • 50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Results
  • 51. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Bayesian DQN or BDQN
  • 52. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Focusing on Efficient Exploration • The central claim is that mechanism such as !-greedy exploration are inefficient. • Thompson Sampling allows for targeted exploration at higher dimension but is computationally too expensive. • BDQN targets to implement Thompson Sampling at scale though function approximation. • BDQN combines DQN with a BLR (Bayesian Linear Regression) model on the last layer.
  • 53. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thomson Sampling • Thompson sampling involves maintaining a prior distribution over the environment models (reward and/or dynamics) • The distribution is updated as observations are made • To choose an action, a sample from the posterior belief is drawn and an action is selected that maximizes the expected return under the sampled belief. • For more information please refer to,”A Tutorial on Thompson Sampling, Daniel Russo et al.” https://arxiv.org/abs/1707.02038
  • 54. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. TS vs !-Greedy • !-greedy focuses on greedy action. • TS explores actions with higher estimated return with higher probability. • TS based strategy advances the exploration/exploitation balance by making a trade-off between the expected returns and the uncertainties, while ε− greedy strategy ignores all of this information.
  • 55. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. TS vs !-Greedy
  • 56. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. TS vs !-Greedy • TS finds optimal Q-Function faster. • Randomizes over Q-Functions with high promising returns and high uncertainty. • When true Q-Function is selected, it increases posterior probability. • When other function are selected, wrong values are estimated and the posterior probability is set to zero. • !-greedy agent randomizes its action with probability of !, even after having chosen the true Q-Function, therefore, it takes exponentially many trials in order to get to the target.
  • 57. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. BDQN Algorithm
  • 58. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Network Architecture
  • 59. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. BDQN Performance
  • 60. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Closing Words
  • 61. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Value Alignment
  • 62. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. References • DQN: https://www.nature.com/articles/nature14236 • DDQN: https://arxiv.org/abs/1509.06461 • BDQN: https://arxiv.org/abs/1802.04412 • DQN MXNet Code: • https://github.com/apache/incubator-mxnet/tree/master/example/reinforcement-learning/dqn • DQN MXNet/Gluon Code:https://github.com/zackchase/mxnet-the-straight- dope/blob/master/chapter17_deep-reinforcement-learning/DQN.ipynb • DDQN MXNet/Gluon Code: https://github.com/zackchase/mxnet-the-straight- dope/blob/master/chapter17_deep-reinforcement-learning/DDQN.ipynb • BDQN MXNet/Gluon Code: https://github.com/kazizzad/BDQN-MxNet-Gluon
  • 63. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.