Deep Reinforcement Learning: An Almost Math-Free Introduction

•

1 like•727 views

MeetupDataScienceRoma

Simone Totaro (AI-Academy): Deep Reinforcement Learning Meetup Machine Learning Data Science Roma 5/7/2017

Data & Analytics

DeepReinforcementLearning
Agentileand(almost)mathfreeintroduction
simone@ai-academy.com
1

Outline
What, why and where it stands in ML?
General framework
Q-Learning
Deep + ...
Codeanddemo(hopefullyworking...)
2

"RL tries to understand the optimal way to make
decisions."
David Silver - Research Scientist, Google DeepMind
6

Howdoesitwork?
= ( , , , , . . . , , , , )Ht s1 a1 r1 s2 st at rt sT
7

Whyitisdi erentfromotherMLsettings?
No supervisor
Delayed feedback
Time matters
Data depends on the Agent policy
8

Environment
import Gym
env = Gym.make('SpaceInvaders-v0')
s = env.step()
# take an action
terminal = False
while not terminal:
next_state, reward, terminal, _ = env.step(action)
11

Keyassumptions
1. The probability of the next state depends only on the current state
2. Each state contains all the relevant information
13

It'sMe,Mario!
Mario wants to break bricks and free the princess!
14

Expectedfuturerewards
Any goal can be represented as a sum of intermediate rewards.
[ ∣ ] = [ + γ + + … ∣ ]∑
∞
t=0
γ
t
Rt St R0 R1 γ
2
R2 St
15

Tools
1. Policy:
2. Value function:
3. Model:
We have to pick at least 1 of the 3.
π(a|s)
Q(s, a)
(P, R)
16

Policy
A policy de nes how the agent behaves.
It takes as input a state and output an action.
It can be stochastic, or deterministic.
17

Valuefunction
A value function estimates how much reward the agent can achieve.
It takes as input a (state,action), and output values.
One for each possible action.
18

Model
A model is the Agent representation of the environment.
Takes as input a state and output (next_state,reward).
19

Designchoice
Balance learning and planning
Explore new actions and exploit good ones
Assign credits for correct actions
20

Howtosolveit?
The goal is to nd the optimalpolicy that maximize the futureexpected
rewards.
22

Repeat
1. Prediction: Compute the value of the expected reward from until the
terminal state.
2. Control: Act greedly with respect to the predicted values.
st
23

Approximationofthevaluefunction
Monte Carlo (used in Alpha Go)
Temporal Di erence (used in Atari)
24

Updaterule
In rabbits, humans and machines we get the same algorithm:
while True:
Q[t] = Q[t-1] + alpha * (Q_target - Q[t-1])
27

Q-Learning[Watkins,1989]
The agent does not have a model of the environment.
Perform actions following a standard policy.
Predict using the target policy.
Which makes it an "o -policy", model-free method.
28

Lossfunction
Building on what we learn from the rabbit.
The learning goal is to minimize the following loss function:
Putting all together we get...
Q_target = r + gamma * np.argmax( Q(s, A))
Loss = 1/n * np.sum( (Q_target - Q(s,a))^2)
29

DeepQ-Learning
Let's add Neural Networks and we are good to go right?
30

Notice...
1. Data are highly correlated
2. The target values are not robust
3. Wild rewards make the value function freaks out
32

Wewish...
A stable Q_target, a robust Q and predictable rewards.
But how?
33

DeepMindideas
1. Di erent neural networks for Q and Q_target
2. Estimate Q_target using past experiences
3. Update Q_target every C steps
4. Clip rewards between -1 and 1
34

Network
Input: an image of shape [None, 42, 42, 4]
4 Conv2D 32 lters, 4x4 kernel
1 Hidden layer of size 256
1 Fully connected layer of size action_size
35

Hyperparams
Learning rate: 0.001
Reward clip: (-1, 1)
Gradient clip: 40
Optimizer: AdamOptimizer
36

Tools
Challenges
Demo
OpenAI
Tensor ow
General AI Challenge
Stanford
38

Resources:
Papers
RL - David Silver
Introduction to RL
Patacchiola Blog
Human Level control
Async Method for DRL
39

ThankstoMachineLearning/DataScience
Meetup
simone@ai-academy.com
41

What's hot

An introduction to reinforcement learningSubrat Panda, PhD

Reinforcement Learning : A Beginners TutorialOmar Enayet

A brief overview of Reinforcement Learning applied to gamesThomas da Silva Paula

Policy gradientJie-Han Chen

Reinforcement Learning Q-Learning Melaku Eneayehu

An introduction to reinforcement learningJie-Han Chen

Reinforcement Learning 5. Monte Carlo MethodsSeung Jae Lee

Multi-Armed Bandit and ApplicationsSangwoo Mo

Deep Q-LearningNikolay Pavlov

Reinforcement learningDongHyun Kwak

Deep Reinforcement Learning and Its ApplicationsBill Liu

Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...SlideTeam

Proximal Policy Optimization (Reinforcement Learning)Thom Lane

Reinforcement learning, Q-LearningKuppusamy P

Multi-armed BanditsDongmin Lee

Reinforcement Learning 6. Temporal Difference LearningSeung Jae Lee

Reinforcement learningShahan Ali Memon

Deep Reinforcement Learning: Q-LearningKai-Wen Zhao

Reinforcement learning Chandra Meena

Reinforcement learning 7313Slideshare

What's hot (20)

An introduction to reinforcement learning

Reinforcement Learning : A Beginners Tutorial

A brief overview of Reinforcement Learning applied to games

Policy gradient

Reinforcement Learning Q-Learning

An introduction to reinforcement learning

Reinforcement Learning 5. Monte Carlo Methods

Multi-Armed Bandit and Applications

Deep Q-Learning

Reinforcement learning

Deep Reinforcement Learning and Its Applications

Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...

Proximal Policy Optimization (Reinforcement Learning)

Reinforcement learning, Q-Learning

Multi-armed Bandits

Reinforcement Learning 6. Temporal Difference Learning

Reinforcement learning

Deep Reinforcement Learning: Q-Learning

Reinforcement learning

Reinforcement learning 7313

Similar to Deep Reinforcement Learning: An Almost Math-Free Introduction

TensorFlow and Deep Learning Tips and TricksBen Ball

Reinfrocement LearningNatan Katz

Demystifying deep reinforement learning재연 윤

GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...The Statistical and Applied Mathematical Sciences Institute

Playing Atari with Deep Reinforcement LearningWilly Marroquin (WillyDevNET)

Deep RL.pdfMohammadHosseinModir

NUS-ISS Learning Day 2019-Introduction to reinforcement learningNUS-ISS

Reinforcement Learning - DQNMohammaderfan Arefimoghaddam

Designing an AI that gains experience for absolute beginnersTanzim Saqib

Introduction to Deep Reinforcement LearningIDEAS - Int'l Data Engineering and Science Association

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf

Reinforcement learning Research experiments OpenAIRaouf KESKES

Discrete sequential prediction of continuous actions for deep RLJie-Han Chen

PPT - Discovering Reinforcement Learning AlgorithmsJisang Yoon

Multi objective optimization and Benchmark functions resultPiyush Agarwal

PPT - Deep Hedging OF Derivatives Using Reinforcement LearningJisang Yoon

Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益黃

A Brief Survey of Reinforcement LearningGiancarlo Frison

Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers

Introduction to PyTorchJun Young Park

Similar to Deep Reinforcement Learning: An Almost Math-Free Introduction (20)

TensorFlow and Deep Learning Tips and Tricks

Reinfrocement Learning

Demystifying deep reinforement learning

GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...

Playing Atari with Deep Reinforcement Learning

Deep RL.pdf

NUS-ISS Learning Day 2019-Introduction to reinforcement learning

Reinforcement Learning - DQN

Designing an AI that gains experience for absolute beginners

Introduction to Deep Reinforcement Learning

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Reinforcement learning Research experiments OpenAI

Discrete sequential prediction of continuous actions for deep RL

PPT - Discovering Reinforcement Learning Algorithms

Multi objective optimization and Benchmark functions result

PPT - Deep Hedging OF Derivatives Using Reinforcement Learning

Financial Trading as a Game: A Deep Reinforcement Learning Approach

A Brief Survey of Reinforcement Learning

Jay Yagnik at AI Frontiers : A History Lesson on AI

Introduction to PyTorch

Recently uploaded

Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics

April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann

How we prevented account sharing with MFAAndrei Kaleshka

LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

Semantic Shed - Squashing and Squeezing.pptxMike Bennett

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档208367051

Vision, Mission, Goals and Objectives ppt..pptxellehsormae

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

detection and classification of knee osteoarthritis.pptxAleenaJamil4

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

Easter Eggs From Star Wars and in cars 1 and 217djon017

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics

Recently uploaded (20)

Advanced Machine Learning for Business Professionals

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...

April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024

How we prevented account sharing with MFA

LLMs, LMMs, their Improvement Suggestions and the Path towards AGI

Data Factory in Microsoft Fabric (MsBIP #82)

DBA Basics: Getting Started with Performance Tuning.pdf

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

Semantic Shed - Squashing and Squeezing.pptx

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档

Vision, Mission, Goals and Objectives ppt..pptx

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...

detection and classification of knee osteoarthritis.pptx

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

Easter Eggs From Star Wars and in cars 1 and 2

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...

Deep Reinforcement Learning: An Almost Math-Free Introduction

1. DeepReinforcementLearning Agentileand(almost)mathfreeintroduction simone@ai-academy.com 1

2. Outline What, why and where it stands in ML? General framework Q-Learning Deep + ... Codeanddemo(hopefullyworking...) 2

3. 3

4. Atari[Nature,2015] 4

5. AlphaGo[Nature,2016] 5

6. "RL tries to understand the optimal way to make decisions." David Silver - Research Scientist, Google DeepMind 6

7. Howdoesitwork? = ( , , , , . . . , , , , )Ht s1 a1 r1 s2 st at rt sT 7

8. Whyitisdi erentfromotherMLsettings? No supervisor Delayed feedback Time matters Data depends on the Agent policy 8

9. Whatcanwemodel? Environment & Agent 9

10. Universe

11. 10

12. Environment import Gym env = Gym.make('SpaceInvaders-v0') s = env.step() # take an action terminal = False while not terminal: next_state, reward, terminal, _ = env.step(action) 11

13. s_1 -> s_2 -> s_3 12

14. Keyassumptions 1. The probability of the next state depends only on the current state 2. Each state contains all the relevant information 13

15. It'sMe,Mario! Mario wants to break bricks and free the princess! 14

16. Expectedfuturerewards Any goal can be represented as a sum of intermediate rewards. [ ∣ ] = [ + γ + + … ∣ ]∑ ∞ t=0 γ t Rt St R0 R1 γ 2 R2 St 15

17. Tools 1. Policy: 2. Value function: 3. Model: We have to pick at least 1 of the 3. π(a|s) Q(s, a) (P, R) 16

18. Policy A policy de nes how the agent behaves. It takes as input a state and output an action. It can be stochastic, or deterministic. 17

19. Valuefunction A value function estimates how much reward the agent can achieve. It takes as input a (state,action), and output values. One for each possible action. 18

20. Model A model is the Agent representation of the environment. Takes as input a state and output (next_state,reward). 19

21. Designchoice Balance learning and planning Explore new actions and exploit good ones Assign credits for correct actions 20

22. Quick-Q&A 21

23. Howtosolveit? The goal is to nd the optimalpolicy that maximize the futureexpected rewards. 22

24. Repeat 1. Prediction: Compute the value of the expected reward from until the terminal state. 2. Control: Act greedly with respect to the predicted values. st 23

25. Approximationofthevaluefunction Monte Carlo (used in Alpha Go) Temporal Di erence (used in Atari) 24

26. TemporalDi erence 25

27. Pavlovianconditioning 26

28. Updaterule In rabbits, humans and machines we get the same algorithm: while True: Q[t] = Q[t-1] + alpha * (Q_target - Q[t-1]) 27

29. Q-Learning[Watkins,1989] The agent does not have a model of the environment. Perform actions following a standard policy. Predict using the target policy. Which makes it an "o -policy", model-free method. 28

30. Lossfunction Building on what we learn from the rabbit. The learning goal is to minimize the following loss function: Putting all together we get... Q_target = r + gamma * np.argmax( Q(s, A)) Loss = 1/n * np.sum( (Q_target - Q(s,a))^2) 29

31. DeepQ-Learning Let's add Neural Networks and we are good to go right? 30

32. 31

33. Notice... 1. Data are highly correlated 2. The target values are not robust 3. Wild rewards make the value function freaks out 32

34. Wewish... A stable Q_target, a robust Q and predictable rewards. But how? 33

35. DeepMindideas 1. Di erent neural networks for Q and Q_target 2. Estimate Q_target using past experiences 3. Update Q_target every C steps 4. Clip rewards between -1 and 1 34

36. Network Input: an image of shape [None, 42, 42, 4] 4 Conv2D 32 lters, 4x4 kernel 1 Hidden layer of size 256 1 Fully connected layer of size action_size 35

37. Hyperparams Learning rate: 0.001 Reward clip: (-1, 1) Gradient clip: 40 Optimizer: AdamOptimizer 36

38. 37

39. Tools Challenges Demo OpenAI Tensor ow General AI Challenge Stanford 38

40. Resources: Papers RL - David Silver Introduction to RL Patacchiola Blog Human Level control Async Method for DRL 39

41. Q&A 40

42. ThankstoMachineLearning/DataScience Meetup simone@ai-academy.com 41