Reinforcement Learning
Learning from experience like a human …
Nokia Bell Labs / Norbert Kraft
Introduction Nokia Bell Labs Access
End to End Network & Service Automation
Application Platforms & Software Systems
Standardization
Smart Network Fabric
Algorithms, Analytics & Augmented Intelligence
Emerging Materials, Components and Devices
ResearchActivities
Analyze
Taxonomy of Machine Learning
From Analysis to Full Autonomous Control
Descriptive
Analytics
What happened?
Predict
Predictive
Analytics
What will happen?
Control
Prescriptive
Analytics
Make it happen!
Difficulty
Value
Control
Predict
Analyze
Monitoring
Sensing
11/12/183 Reinforcement Learning - learning from experience like a human
Don’t analyze, why you failed – just do it right ….
Reinforcement Learning - learning from experience like a human11/12/184
Machine Learning
Unsupervised
learning
Find
anomalies,
similarities,
groups
Reduce
complexity
of high
dimensional
features
Supervised
learning
Learn from
labelled
observations
Reinforcement
learning
Learn
from
experience
Reinforcement Learning - learning from experience like a human
Basic Ideas …
Find groups with
similar attributes,
which are not
necessarily self-
explaining …
Generate a limited
set of new features
with virtual meaning
…
Train human
knowledge by taking
a labelled list
(… not always existing)
11/12/185
Learn system
behavior with
experiments
(learning by doing)
Machine Learning
Basic Ideas …
11/12/18 Reinforcement Learning - learning from experience like a human6
Find groups with
similar attributes,
which are not
necessarily self-
explaining …
Generate a limited
set of new features
with virtual meaning
…
Train human
knowledge by taking
a labelled list … not
always existing
Supervised LearningIn
Unsupervised
Learning
In Out
Supervised
Learning
In Out
Reinforcement
Learning
Target
Error
In Out
Reward
Learning from
experience with
reward by
trial & error
Machine Learning
Deep Learning
Unsupervised
learning
Find
anomalies,
similarities,
groups
Reduce complexity
of high dimensional
features
Supervised
learning
Learn from labelled
observations
Reinforcement
learning
Learn
from
experience
Reinforcement Learning - learning from experience like a human
Machine Learning Concepts & Deep Learning
11/12/187
How does this map to a human brain?
11/12/18 Reinforcement Learning - learning from experience like a human8
Basal Ganglia
Reinforcement
Learning
Reward≈Dopamine
Cerebral Cortex
Unsupervised
Learning
Cerebellum
Supervised
Learning
Machine Learning
Reinforcement Learning - learning from experience like a human
One Way of Human Thinking/Learning
Observations Actions
Reward
Find action
to optimize reward …
11/12/189
The Human Way of Thinking/Learning
Reinforcement Learning - learning from experience like a human
Learning is Trial and Error … again and again
???
11/12/1810
Differences between Human Brain and a Neural Networks
11/12/18 Reinforcement Learning - learning from experience like a human11
Characteristic Human Brain Neural Network
Feed forward Yes Yes
Feed backward Yes Yes (only RNNs …)
Complexity 1011 neurons 109 transistors
Switching speed 10-3 secs. 10-9 secs.
Structure Hierarchical Flat & simplistic
Operation Massively parallel Still serial & parallel
How about ??
• Intuition
• Instinct
• Gut feeling
• Mind
• Intellect
Most People have an idea of a dangerous animals
without learning it …
11/12/18 Reinforcement Learning - learning from experience like a human12
Reinforcement Learning
Robotics
Control
physical
systems
Games Optimization General
Compute
Problems
Router/Radio
Channel
Assignment
Power
optimization
Scheduling
algorithms
Admission
Control
Anomaly
detection
Reinforcement Learning - learning from experience like a human
Application Areas
11/12/1813
Reinforcement Learning
Reinforcement Learning - learning from experience like a human
Universal Self Learning with Autonomous Algorithms
Universal
Autonomous
Self-learning
Algorithms
11/12/1814
Reinforcement Learning No pre-define knowledge
Starts with random action
Trial & error learning
Find solution with optimum reward
Agent/environment states are hidden
Controller receives observations, reward and
triggers action
System receives action, goes into next state,
generates observations and reward
Reinforcement Learning - learning from experience like a human
Components & Interaction
Controller
(agent)
System
(environment)
observations
actions
reward
11/12/1815
How does this translate to real environments
• Human, Neural Net, Decision Tree, Coded
AlgorithmAgent
• Robot, Machine, Chess / Go game,
Telecommunication network, ProblemEnvironment
• Go left/right, stop, move pawn to, set parameter
value toActions
• Car Position/Speed, (chess) piece
positions/value, temperature, performance KPIObservations
• Power consumption, no/value of (chess) pieces,
game score, call success rateReward
Reinforcement Learning - learning from experience like a human
Examples: Robots, Games, Telecom
Controller
(agent)
System
(environment)
observations
actions
reward
11/12/1816
System (environment)
Reinforcement Learning
Reinforcement Learning - learning from experience like a human
Examples: Telecom
Controller (agent)
KPIs
ChangeParameter
High
Level
KPIs
CEI
Power
11/12/1817
Some theory …
11/12/18 Reinforcement Learning - learning from experience18
Markov Properties
Fully observable process
• The current state completely characterizes the process
The future is independent of the past given
the present
The state captures all relevant information
from the history
Once the state is known, the history may be
thrown away
Reinforcement Learning - learning from experience like a human11/12/1819
Reinforcement Learning
Observability
• Full: Sa
t = Se
t
• Partial: Sa
t ≠ Se
t
Agent functions
• Policy (predicted action based on state)
• a = 𝜋(𝑠)
• Value (pred. of future reward)
• vπ(s, a) = 𝐸 𝜋[𝑅𝑡 + 1,
𝑅	𝑡 + 2	 …
]
• Model
• Build transition model of the system
Reinforcement Learning - learning from experience like a human
Formalisms
Controller
(agent)
System
(environment)
Ot Rt At
Sa
t
Se
t
Value
PolicyModel
11/12/1820
Agent Functions are Optional
Value
PolicyModel
• No Policy (Implicit)
• Value function
Value
Based
• Policy function
• No value function
Policy
Based
• Policy function
• Value function
Actor
Critic
• Policy and/or value function
• No model
Model
Free
• Policy and/or Value Function
• Model
Model
Based
Reinforcement Learning - learning from experience like a human
Agent Types
11/12/1821
Exploitation
Reinforcement Learning
•Find more information about the
environment …
• Try random action
• Use action not used before in this state
• …
Exploration
•Exploit already known information to
maximize reward …
• Use action promising most direct reward
• Use action promising most future reward
• …
Exploitation
Reinforcement Learning - learning from experience like a human
Exploration vs. Exploitation
Exploration
Optimal
solution
No convergence
Sub optimum
Exploration-Exploitation Dilemma
11/12/1822
Reinforcement Learning
Use certain amount of random actions
• 1	 − 𝑟𝑎𝑛 0,1 >	∈ -> 𝑄∗ 𝑠, 𝑎
• 1	 − 𝑟𝑎𝑛 0,1 ≤	∈ -> 𝑟𝑎𝑛 𝑎
Decrease ∈ over time
• ∈?@A= 	𝜆	 ∗	∈?	
11/12/18 Reinforcement Learning - learning from experience like a human23
Exploration/Exploitation Strategies : (dynamic) ∈ - Greedy
Reinforcement Learning & Neural Nets
• Ot1 = f(xt1, yt1, zt1 …) 𝑥, 𝑦, 𝑧	 ∈ 𝑁
• At1 = Map(xt1, yt1, zt1 …)
Limited
observation
space
• Ot1 = f(xt1, yt1, zt1 …) 𝑥, 𝑦, 𝑧	 ∈ 𝑅
• At1 = model.predict(xt1, yt1, zt1 …)
Unlimited
observation
space
Reinforcement Learning - learning from experience like a human
Limited vs. Unlimited spaces for Observations, Actions
Algorithms / tables
Policy / Mapping
Supervised Learning
Decision Trees
Lin/log Regression
Supervised Learning
Neural Networks
Deep Learning
11/12/1824
Reinforcement Learning
• Discounted Reward: GT
• Discount factor 𝛾 ∈	 0,1
- 𝛾 ≈ 1 ‘far-sighted’ evaluation
- 𝛾 ≈ 0 ‘myopic’ evaluation
Reinforcement Learning - learning from experience like a human
Discounted Future Reward
11/12/1825
Reinforcement Learning
Result of Q-Function represents actual & future reward
• Based on current state(s) and action(a) applied
• Corrected by maximal achievable award in state 𝑠?@A
Learning is done by continuous update of Q
• 𝛼 learning rate (adoption rate for learned knowledge)
• 𝛾 discount factor for future reward
Reinforcement Learning - learning from experience like a human
Temporal-Difference Learning: Q Learning
𝑄 𝑠?,	 𝑎? = 𝑚𝑎𝑥𝑅?@K
𝑄 𝑠?,	 𝑎? = 1	 − 	𝛼 	𝑄 𝑠?,	 𝑎? + 	𝛼(𝑟? + 𝛾 𝑚𝑎𝑥 𝑄(𝑠?@A, 𝑎?@A))
11/12/1826
Reinforcement Learning
Neural networks for large observation & actions spaces
• Can work with pixel based observations
• Large amount of setup values (actions)
Different variants
• One feed forward per s,a combination 𝑄(𝑠, 𝑎)
• One feed forward per state 𝑄 𝑠 	
Reinforcement Learning - learning from experience like a human
Deep Q networks
11/12/1827
State
Action
Neural Net Q Value State Neural Net
Q
Value(a1)
Q
Value(an)
…or
1. 2.
Reinforcement Learning
Given Transition < 𝑠, 𝑎, 𝑟, 𝑠M >
1. Do feed forward for all actions in state s
2. Get max. Q value for all actions in state s’
3. Set target value for 𝑄 𝑠, 𝑎 = 𝑟 + 	𝛾	𝑚𝑎𝑥𝑄(𝑠M, 𝑎M)
4. Update weights using back propagation
11/12/18 Reinforcement Learning - learning from experience like a human28
Deep Q networks Update Rules
Reinforcement Learning
Small changes in Q-value could cause a totally different action
selection.
No convergence guarantee.
Tries to find deterministic value function, some problems require
a stochastic value function
Reinforcement Learning - learning from experience like a human
Q Learning Pre-requisites / Limitations
11/12/1829
Some examples …
11/12/18 Reinforcement Learning - learning from experience30
OpenAI
Ready to use environments for agent &
algorithm development
• Computing problems
• Games
• Robots
• 2D Problems
Find the optimal model/policy/value function
for a problem
• Model based (unlimited action, observation, reward space)
• Value/Policy based (limited action, observation, reward
space)
Functions:
• Ot0 = reset()
• Ot1, Rt1, Se
t = step(At0)
Reinforcement Learning - learning from experience like a human
An Agent Development Environment
Controller
(agent)
OpenAI
System
(environment)
Ot Rt At
Sa
t
Se
t
Value
PolicyModel
11/12/1831
OpenAI
Reinforcement Learning - learning from experience like a human
Environments for Advanced Algorithm Development
Acrobot CartPole Car over mountain Pendulum
Humanoid stand upTennisCar Race Lunar Lander
Humanoid
Robot
11/12/1832
Computational alg.
Reinforcement Learning Balance inverted pendulum
• Simplified for 1 dimension
State
• Cart position [-2.4, 2.4]
• Cart Velocity [-inf, inf]
• Pole Angle [-41°, 41°]
• Pole velocity at tip [-inf, inf]
Actions
• Impacts cart direction & velocity
• Push cart to left
• Push cart to right
Termination
• Cart position at boundary (fails)
• Angle outside [-12, 12] (fails)
• More than 200 steps (terminates successfully)
Reward
• +1 for every step not terminating
Reinforcement Learning - learning from experience like a human
Example: CartPole
By using random actions pole returns to stable state
11/12/1833
Reinforcement Learning
Example: CartPole Solving with model based algorithm (RandomForest)
11/12/18 Reinforcement Learning - learning from experience like a human34
Reinforcement Learning
Example: CartPole Solving with model based algorithm (Neural Network)
11/12/18 Reinforcement Learning - learning from experience like a human35
Reinforcement Learning
Copy characters from observation tape to output tape
•Various character sets [A..[
•Different string length increasing during different runs
State
•Character observed at read head
Actions
•Move read head left or right
•Copy character to output tape or not
Termination
•Wrong character written (fails)
•Timeout after some amount of unsuccessful trials (fails)
•All characters written to output tape (terminates successfully)
Reward
•+1 for correct character written
•-0.5 for wrong character written
•0 for plain head movements
Reinforcement Learning - learning from experience like a human
Example: Copy (Algorithm environment)
11/12/1836
Reinforcement Learning
Example: Copy (Algorithm environment) - Solved with discrete Q-Learning
11/12/18 Reinforcement Learning - learning from experience like a human37
Successfully learned to copy
strings of random length and
content
Reinforcement Learning Under powered car to go across a hill
• You have to go backward to get enough swing
State
• Position on x axis
• Speed
Actions
• Push forward
• Push backward
• Do nothing
Termination
• Time after 200 steps
• Car reaches the flag on the hill
Reward
• -1 for every step
• +0.5 for right push & speed > 0
• +0.5 for left push & speed < 0
Reinforcement Learning - learning from experience like a human
Example: Mountain Car
11/12/1838
Reinforcement Learning
11/12/18 Reinforcement Learning - learning from experience like a human39
Mountain Car Videos
Random Walk Training phase
Reinforcement Learning Land space ship on the moon
• Land in landing zone
• Surface and start condition change
State
• 8 real values (position, angle, speed…)
Actions
• Fire main engine
• Fire left/right engine
Termination
• Move to landing pad with zero speed
• Can also land outside landing pad
Reward
• Firing main engine -0.3 (unlimited fuel)
• Ground contact +10
• Landing in pad 100-140
Reinforcement Learning - learning from experience like a human
Example: Lunar Lander
11/12/1840
Reinforcement Learning
11/12/18 Reinforcement Learning - learning from experience like a human41
Example: Lunar Lander
Random Walk Training phase
Reinforcement Learning
Example: Lunar Lander
11/12/18 Reinforcement Learning - learning from experience like a human42
Some final words …
11/12/18 Reinforcement Learning - learning from experience43
Reinforcement Learning
• Direct reward gives insufficient feed back on success strategyDelayed Rewards
• e.g. states using pictures & complex sensors
• Requires deep learning
Continuous/Large
observations states
• Slows down solution convergence
• Find (sub-)optimal solution
Exploration/exploitation
strategies
• Only a mix of policy, value, model and q-function solves most problems
• Standard supervised algorithms do not solve the problemMeta solution strategies
Reinforcement Learning - learning from experience like a human
Problems & Research Areas
11/12/1844
Reinforcement Learning
11/12/18 Reinforcement Learning - learning from experience like a human45
Takeaways …
Thank you
Questions & Answers
norbert.kraft@nokia-bell-labs.com

Reinforcement Learning - Learning from Experience like a Human

  • 1.
    Reinforcement Learning Learning fromexperience like a human … Nokia Bell Labs / Norbert Kraft
  • 2.
    Introduction Nokia BellLabs Access End to End Network & Service Automation Application Platforms & Software Systems Standardization Smart Network Fabric Algorithms, Analytics & Augmented Intelligence Emerging Materials, Components and Devices ResearchActivities
  • 3.
    Analyze Taxonomy of MachineLearning From Analysis to Full Autonomous Control Descriptive Analytics What happened? Predict Predictive Analytics What will happen? Control Prescriptive Analytics Make it happen! Difficulty Value Control Predict Analyze Monitoring Sensing 11/12/183 Reinforcement Learning - learning from experience like a human
  • 4.
    Don’t analyze, whyyou failed – just do it right …. Reinforcement Learning - learning from experience like a human11/12/184
  • 5.
    Machine Learning Unsupervised learning Find anomalies, similarities, groups Reduce complexity of high dimensional features Supervised learning Learnfrom labelled observations Reinforcement learning Learn from experience Reinforcement Learning - learning from experience like a human Basic Ideas … Find groups with similar attributes, which are not necessarily self- explaining … Generate a limited set of new features with virtual meaning … Train human knowledge by taking a labelled list (… not always existing) 11/12/185 Learn system behavior with experiments (learning by doing)
  • 6.
    Machine Learning Basic Ideas… 11/12/18 Reinforcement Learning - learning from experience like a human6 Find groups with similar attributes, which are not necessarily self- explaining … Generate a limited set of new features with virtual meaning … Train human knowledge by taking a labelled list … not always existing Supervised LearningIn Unsupervised Learning In Out Supervised Learning In Out Reinforcement Learning Target Error In Out Reward Learning from experience with reward by trial & error
  • 7.
    Machine Learning Deep Learning Unsupervised learning Find anomalies, similarities, groups Reducecomplexity of high dimensional features Supervised learning Learn from labelled observations Reinforcement learning Learn from experience Reinforcement Learning - learning from experience like a human Machine Learning Concepts & Deep Learning 11/12/187
  • 8.
    How does thismap to a human brain? 11/12/18 Reinforcement Learning - learning from experience like a human8 Basal Ganglia Reinforcement Learning Reward≈Dopamine Cerebral Cortex Unsupervised Learning Cerebellum Supervised Learning
  • 9.
    Machine Learning Reinforcement Learning- learning from experience like a human One Way of Human Thinking/Learning Observations Actions Reward Find action to optimize reward … 11/12/189
  • 10.
    The Human Wayof Thinking/Learning Reinforcement Learning - learning from experience like a human Learning is Trial and Error … again and again ??? 11/12/1810
  • 11.
    Differences between HumanBrain and a Neural Networks 11/12/18 Reinforcement Learning - learning from experience like a human11 Characteristic Human Brain Neural Network Feed forward Yes Yes Feed backward Yes Yes (only RNNs …) Complexity 1011 neurons 109 transistors Switching speed 10-3 secs. 10-9 secs. Structure Hierarchical Flat & simplistic Operation Massively parallel Still serial & parallel How about ?? • Intuition • Instinct • Gut feeling • Mind • Intellect
  • 12.
    Most People havean idea of a dangerous animals without learning it … 11/12/18 Reinforcement Learning - learning from experience like a human12
  • 13.
    Reinforcement Learning Robotics Control physical systems Games OptimizationGeneral Compute Problems Router/Radio Channel Assignment Power optimization Scheduling algorithms Admission Control Anomaly detection Reinforcement Learning - learning from experience like a human Application Areas 11/12/1813
  • 14.
    Reinforcement Learning Reinforcement Learning- learning from experience like a human Universal Self Learning with Autonomous Algorithms Universal Autonomous Self-learning Algorithms 11/12/1814
  • 15.
    Reinforcement Learning Nopre-define knowledge Starts with random action Trial & error learning Find solution with optimum reward Agent/environment states are hidden Controller receives observations, reward and triggers action System receives action, goes into next state, generates observations and reward Reinforcement Learning - learning from experience like a human Components & Interaction Controller (agent) System (environment) observations actions reward 11/12/1815
  • 16.
    How does thistranslate to real environments • Human, Neural Net, Decision Tree, Coded AlgorithmAgent • Robot, Machine, Chess / Go game, Telecommunication network, ProblemEnvironment • Go left/right, stop, move pawn to, set parameter value toActions • Car Position/Speed, (chess) piece positions/value, temperature, performance KPIObservations • Power consumption, no/value of (chess) pieces, game score, call success rateReward Reinforcement Learning - learning from experience like a human Examples: Robots, Games, Telecom Controller (agent) System (environment) observations actions reward 11/12/1816
  • 17.
    System (environment) Reinforcement Learning ReinforcementLearning - learning from experience like a human Examples: Telecom Controller (agent) KPIs ChangeParameter High Level KPIs CEI Power 11/12/1817
  • 18.
    Some theory … 11/12/18Reinforcement Learning - learning from experience18
  • 19.
    Markov Properties Fully observableprocess • The current state completely characterizes the process The future is independent of the past given the present The state captures all relevant information from the history Once the state is known, the history may be thrown away Reinforcement Learning - learning from experience like a human11/12/1819
  • 20.
    Reinforcement Learning Observability • Full:Sa t = Se t • Partial: Sa t ≠ Se t Agent functions • Policy (predicted action based on state) • a = 𝜋(𝑠) • Value (pred. of future reward) • vπ(s, a) = 𝐸 𝜋[𝑅𝑡 + 1, 𝑅 𝑡 + 2 … ] • Model • Build transition model of the system Reinforcement Learning - learning from experience like a human Formalisms Controller (agent) System (environment) Ot Rt At Sa t Se t Value PolicyModel 11/12/1820
  • 21.
    Agent Functions areOptional Value PolicyModel • No Policy (Implicit) • Value function Value Based • Policy function • No value function Policy Based • Policy function • Value function Actor Critic • Policy and/or value function • No model Model Free • Policy and/or Value Function • Model Model Based Reinforcement Learning - learning from experience like a human Agent Types 11/12/1821
  • 22.
    Exploitation Reinforcement Learning •Find moreinformation about the environment … • Try random action • Use action not used before in this state • … Exploration •Exploit already known information to maximize reward … • Use action promising most direct reward • Use action promising most future reward • … Exploitation Reinforcement Learning - learning from experience like a human Exploration vs. Exploitation Exploration Optimal solution No convergence Sub optimum Exploration-Exploitation Dilemma 11/12/1822
  • 23.
    Reinforcement Learning Use certainamount of random actions • 1 − 𝑟𝑎𝑛 0,1 > ∈ -> 𝑄∗ 𝑠, 𝑎 • 1 − 𝑟𝑎𝑛 0,1 ≤ ∈ -> 𝑟𝑎𝑛 𝑎 Decrease ∈ over time • ∈?@A= 𝜆 ∗ ∈? 11/12/18 Reinforcement Learning - learning from experience like a human23 Exploration/Exploitation Strategies : (dynamic) ∈ - Greedy
  • 24.
    Reinforcement Learning &Neural Nets • Ot1 = f(xt1, yt1, zt1 …) 𝑥, 𝑦, 𝑧 ∈ 𝑁 • At1 = Map(xt1, yt1, zt1 …) Limited observation space • Ot1 = f(xt1, yt1, zt1 …) 𝑥, 𝑦, 𝑧 ∈ 𝑅 • At1 = model.predict(xt1, yt1, zt1 …) Unlimited observation space Reinforcement Learning - learning from experience like a human Limited vs. Unlimited spaces for Observations, Actions Algorithms / tables Policy / Mapping Supervised Learning Decision Trees Lin/log Regression Supervised Learning Neural Networks Deep Learning 11/12/1824
  • 25.
    Reinforcement Learning • DiscountedReward: GT • Discount factor 𝛾 ∈ 0,1 - 𝛾 ≈ 1 ‘far-sighted’ evaluation - 𝛾 ≈ 0 ‘myopic’ evaluation Reinforcement Learning - learning from experience like a human Discounted Future Reward 11/12/1825
  • 26.
    Reinforcement Learning Result ofQ-Function represents actual & future reward • Based on current state(s) and action(a) applied • Corrected by maximal achievable award in state 𝑠?@A Learning is done by continuous update of Q • 𝛼 learning rate (adoption rate for learned knowledge) • 𝛾 discount factor for future reward Reinforcement Learning - learning from experience like a human Temporal-Difference Learning: Q Learning 𝑄 𝑠?, 𝑎? = 𝑚𝑎𝑥𝑅?@K 𝑄 𝑠?, 𝑎? = 1 − 𝛼 𝑄 𝑠?, 𝑎? + 𝛼(𝑟? + 𝛾 𝑚𝑎𝑥 𝑄(𝑠?@A, 𝑎?@A)) 11/12/1826
  • 27.
    Reinforcement Learning Neural networksfor large observation & actions spaces • Can work with pixel based observations • Large amount of setup values (actions) Different variants • One feed forward per s,a combination 𝑄(𝑠, 𝑎) • One feed forward per state 𝑄 𝑠 Reinforcement Learning - learning from experience like a human Deep Q networks 11/12/1827 State Action Neural Net Q Value State Neural Net Q Value(a1) Q Value(an) …or 1. 2.
  • 28.
    Reinforcement Learning Given Transition< 𝑠, 𝑎, 𝑟, 𝑠M > 1. Do feed forward for all actions in state s 2. Get max. Q value for all actions in state s’ 3. Set target value for 𝑄 𝑠, 𝑎 = 𝑟 + 𝛾 𝑚𝑎𝑥𝑄(𝑠M, 𝑎M) 4. Update weights using back propagation 11/12/18 Reinforcement Learning - learning from experience like a human28 Deep Q networks Update Rules
  • 29.
    Reinforcement Learning Small changesin Q-value could cause a totally different action selection. No convergence guarantee. Tries to find deterministic value function, some problems require a stochastic value function Reinforcement Learning - learning from experience like a human Q Learning Pre-requisites / Limitations 11/12/1829
  • 30.
    Some examples … 11/12/18Reinforcement Learning - learning from experience30
  • 31.
    OpenAI Ready to useenvironments for agent & algorithm development • Computing problems • Games • Robots • 2D Problems Find the optimal model/policy/value function for a problem • Model based (unlimited action, observation, reward space) • Value/Policy based (limited action, observation, reward space) Functions: • Ot0 = reset() • Ot1, Rt1, Se t = step(At0) Reinforcement Learning - learning from experience like a human An Agent Development Environment Controller (agent) OpenAI System (environment) Ot Rt At Sa t Se t Value PolicyModel 11/12/1831
  • 32.
    OpenAI Reinforcement Learning -learning from experience like a human Environments for Advanced Algorithm Development Acrobot CartPole Car over mountain Pendulum Humanoid stand upTennisCar Race Lunar Lander Humanoid Robot 11/12/1832 Computational alg.
  • 33.
    Reinforcement Learning Balanceinverted pendulum • Simplified for 1 dimension State • Cart position [-2.4, 2.4] • Cart Velocity [-inf, inf] • Pole Angle [-41°, 41°] • Pole velocity at tip [-inf, inf] Actions • Impacts cart direction & velocity • Push cart to left • Push cart to right Termination • Cart position at boundary (fails) • Angle outside [-12, 12] (fails) • More than 200 steps (terminates successfully) Reward • +1 for every step not terminating Reinforcement Learning - learning from experience like a human Example: CartPole By using random actions pole returns to stable state 11/12/1833
  • 34.
    Reinforcement Learning Example: CartPoleSolving with model based algorithm (RandomForest) 11/12/18 Reinforcement Learning - learning from experience like a human34
  • 35.
    Reinforcement Learning Example: CartPoleSolving with model based algorithm (Neural Network) 11/12/18 Reinforcement Learning - learning from experience like a human35
  • 36.
    Reinforcement Learning Copy charactersfrom observation tape to output tape •Various character sets [A..[ •Different string length increasing during different runs State •Character observed at read head Actions •Move read head left or right •Copy character to output tape or not Termination •Wrong character written (fails) •Timeout after some amount of unsuccessful trials (fails) •All characters written to output tape (terminates successfully) Reward •+1 for correct character written •-0.5 for wrong character written •0 for plain head movements Reinforcement Learning - learning from experience like a human Example: Copy (Algorithm environment) 11/12/1836
  • 37.
    Reinforcement Learning Example: Copy(Algorithm environment) - Solved with discrete Q-Learning 11/12/18 Reinforcement Learning - learning from experience like a human37 Successfully learned to copy strings of random length and content
  • 38.
    Reinforcement Learning Underpowered car to go across a hill • You have to go backward to get enough swing State • Position on x axis • Speed Actions • Push forward • Push backward • Do nothing Termination • Time after 200 steps • Car reaches the flag on the hill Reward • -1 for every step • +0.5 for right push & speed > 0 • +0.5 for left push & speed < 0 Reinforcement Learning - learning from experience like a human Example: Mountain Car 11/12/1838
  • 39.
    Reinforcement Learning 11/12/18 ReinforcementLearning - learning from experience like a human39 Mountain Car Videos Random Walk Training phase
  • 40.
    Reinforcement Learning Landspace ship on the moon • Land in landing zone • Surface and start condition change State • 8 real values (position, angle, speed…) Actions • Fire main engine • Fire left/right engine Termination • Move to landing pad with zero speed • Can also land outside landing pad Reward • Firing main engine -0.3 (unlimited fuel) • Ground contact +10 • Landing in pad 100-140 Reinforcement Learning - learning from experience like a human Example: Lunar Lander 11/12/1840
  • 41.
    Reinforcement Learning 11/12/18 ReinforcementLearning - learning from experience like a human41 Example: Lunar Lander Random Walk Training phase
  • 42.
    Reinforcement Learning Example: LunarLander 11/12/18 Reinforcement Learning - learning from experience like a human42
  • 43.
    Some final words… 11/12/18 Reinforcement Learning - learning from experience43
  • 44.
    Reinforcement Learning • Directreward gives insufficient feed back on success strategyDelayed Rewards • e.g. states using pictures & complex sensors • Requires deep learning Continuous/Large observations states • Slows down solution convergence • Find (sub-)optimal solution Exploration/exploitation strategies • Only a mix of policy, value, model and q-function solves most problems • Standard supervised algorithms do not solve the problemMeta solution strategies Reinforcement Learning - learning from experience like a human Problems & Research Areas 11/12/1844
  • 45.
    Reinforcement Learning 11/12/18 ReinforcementLearning - learning from experience like a human45 Takeaways …
  • 46.
    Thank you Questions &Answers norbert.kraft@nokia-bell-labs.com