Reinforcement Learning - Learning from Experience like a Human

Reinforcement Learning
Learning from experience like a human …
Nokia Bell Labs / Norbert Kraft

Introduction Nokia Bell Labs Access
End to End Network & Service Automation
Application Platforms & Software Systems
Standardization
Smart Network Fabric
Algorithms, Analytics & Augmented Intelligence
Emerging Materials, Components and Devices
ResearchActivities

Analyze
Taxonomy of Machine Learning
From Analysis to Full Autonomous Control
Descriptive
Analytics
What happened?
Predict
Predictive
Analytics
What will happen?
Control
Prescriptive
Analytics
Make it happen!
Difficulty
Value
Control
Predict
Analyze
Monitoring
Sensing
11/12/183 Reinforcement Learning - learning from experience like a human

Don’t analyze, why you failed – just do it right ….
Reinforcement Learning - learning from experience like a human11/12/184

Machine Learning
Unsupervised
learning
Find
anomalies,
similarities,
groups
Reduce
complexity
of high
dimensional
features
Supervised
learning
Learn from
labelled
observations
Reinforcement
learning
Learn
from
experience
Reinforcement Learning - learning from experience like a human
Basic Ideas …
Find groups with
similar attributes,
which are not
necessarily self-
explaining …
Generate a limited
set of new features
with virtual meaning
…
Train human
knowledge by taking
a labelled list
(… not always existing)
11/12/185
Learn system
behavior with
experiments
(learning by doing)

Machine Learning
Basic Ideas …
11/12/18 Reinforcement Learning - learning from experience like a human6
Find groups with
similar attributes,
which are not
necessarily self-
explaining …
Generate a limited
set of new features
with virtual meaning
…
Train human
knowledge by taking
a labelled list … not
always existing
Supervised LearningIn
Unsupervised
Learning
In Out
Supervised
Learning
In Out
Reinforcement
Learning
Target
Error
In Out
Reward
Learning from
experience with
reward by
trial & error

Machine Learning
Deep Learning
Unsupervised
learning
Find
anomalies,
similarities,
groups
Reduce complexity
of high dimensional
features
Supervised
learning
Learn from labelled
observations
Reinforcement
learning
Learn
from
experience
Machine Learning Concepts & Deep Learning
11/12/187

How does this map to a human brain?
Basal Ganglia
Reinforcement
Learning
Reward≈Dopamine
Cerebral Cortex
Unsupervised
Learning
Cerebellum
Supervised
Learning

Machine Learning
One Way of Human Thinking/Learning
Observations Actions
Reward
Find action
to optimize reward …
11/12/189

The Human Way of Thinking/Learning
Learning is Trial and Error … again and again
???
11/12/1810

Differences between Human Brain and a Neural Networks
Characteristic Human Brain Neural Network
Feed forward Yes Yes
Feed backward Yes Yes (only RNNs …)
Complexity 1011 neurons 109 transistors
Switching speed 10-3 secs. 10-9 secs.
Structure Hierarchical Flat & simplistic
Operation Massively parallel Still serial & parallel
How about ??
• Intuition
• Instinct
• Gut feeling
• Mind
• Intellect

Most People have an idea of a dangerous animals
without learning it …

Robotics
Control
physical
systems
Games Optimization General
Compute
Problems
Router/Radio
Channel
Assignment
Power
optimization
Scheduling
algorithms
Admission
Control
Anomaly
detection
Application Areas
11/12/1813

Universal Self Learning with Autonomous Algorithms
Universal
Autonomous
Self-learning
Algorithms
11/12/1814

Reinforcement Learning No pre-define knowledge
Starts with random action
Trial & error learning
Find solution with optimum reward
Agent/environment states are hidden
Controller receives observations, reward and
triggers action
System receives action, goes into next state,
generates observations and reward
Components & Interaction
Controller
(agent)
System
(environment)
observations
actions
reward
11/12/1815

How does this translate to real environments
• Human, Neural Net, Decision Tree, Coded
AlgorithmAgent
• Robot, Machine, Chess / Go game,
Telecommunication network, ProblemEnvironment
• Go left/right, stop, move pawn to, set parameter
value toActions
• Car Position/Speed, (chess) piece
positions/value, temperature, performance KPIObservations
• Power consumption, no/value of (chess) pieces,
game score, call success rateReward
Examples: Robots, Games, Telecom
Controller
(agent)
System
(environment)
observations
actions
reward
11/12/1816

System (environment)
Examples: Telecom
Controller (agent)
KPIs
ChangeParameter
High
Level
KPIs
CEI
Power
11/12/1817

Some theory …
11/12/18 Reinforcement Learning - learning from experience18

Markov Properties
Fully observable process
• The current state completely characterizes the process
The future is independent of the past given
the present
The state captures all relevant information
from the history
Once the state is known, the history may be
thrown away
Reinforcement Learning - learning from experience like a human11/12/1819

Observability
• Full: Sa
t = Se
t
• Partial: Sa
t ≠ Se
t
Agent functions
• Policy (predicted action based on state)
• a = 𝜋(𝑠)
• Value (pred. of future reward)
• vπ(s, a) = 𝐸 𝜋[𝑅𝑡 + 1,
𝑅 𝑡 + 2 …
]
• Model
• Build transition model of the system
Formalisms
Controller
(agent)
System
(environment)
Ot Rt At
Sa
t
Se
t
Value
PolicyModel
11/12/1820

Agent Functions are Optional
Value
PolicyModel
• No Policy (Implicit)
• Value function
Value
Based
• Policy function
• No value function
Policy
Based
• Policy function
• Value function
Actor
Critic
• Policy and/or value function
• No model
Model
Free
• Policy and/or Value Function
• Model
Model
Based
Agent Types
11/12/1821

Exploitation
•Find more information about the
environment …
• Try random action
• Use action not used before in this state
• …
Exploration
•Exploit already known information to
maximize reward …
• Use action promising most direct reward
• Use action promising most future reward
• …
Exploitation
Exploration vs. Exploitation
Exploration
Optimal
solution
No convergence
Sub optimum
Exploration-Exploitation Dilemma
11/12/1822

Use certain amount of random actions
• 1 − 𝑟𝑎𝑛 0,1 > ∈ -> 𝑄∗ 𝑠, 𝑎
• 1 − 𝑟𝑎𝑛 0,1 ≤ ∈ -> 𝑟𝑎𝑛 𝑎
Decrease ∈ over time
• ∈?@A= 𝜆 ∗ ∈?
Exploration/Exploitation Strategies : (dynamic) ∈ - Greedy

Reinforcement Learning & Neural Nets
• Ot1 = f(xt1, yt1, zt1 …) 𝑥, 𝑦, 𝑧 ∈ 𝑁
• At1 = Map(xt1, yt1, zt1 …)
Limited
observation
space
• Ot1 = f(xt1, yt1, zt1 …) 𝑥, 𝑦, 𝑧 ∈ 𝑅
• At1 = model.predict(xt1, yt1, zt1 …)
Unlimited
observation
space
Limited vs. Unlimited spaces for Observations, Actions
Algorithms / tables
Policy / Mapping
Supervised Learning
Decision Trees
Lin/log Regression
Supervised Learning
Neural Networks
Deep Learning
11/12/1824

• Discounted Reward: GT
• Discount factor 𝛾 ∈ 0,1
- 𝛾 ≈ 1 ‘far-sighted’ evaluation
- 𝛾 ≈ 0 ‘myopic’ evaluation
Discounted Future Reward
11/12/1825

Result of Q-Function represents actual & future reward
• Based on current state(s) and action(a) applied
• Corrected by maximal achievable award in state 𝑠?@A
Learning is done by continuous update of Q
• 𝛼 learning rate (adoption rate for learned knowledge)
• 𝛾 discount factor for future reward
Temporal-Difference Learning: Q Learning
𝑄 𝑠?, 𝑎? = 𝑚𝑎𝑥𝑅?@K
𝑄 𝑠?, 𝑎? = 1 − 𝛼 𝑄 𝑠?, 𝑎? + 𝛼(𝑟? + 𝛾 𝑚𝑎𝑥 𝑄(𝑠?@A, 𝑎?@A))
11/12/1826

Neural networks for large observation & actions spaces
• Can work with pixel based observations
• Large amount of setup values (actions)
Different variants
• One feed forward per s,a combination 𝑄(𝑠, 𝑎)
• One feed forward per state 𝑄 𝑠
Deep Q networks
11/12/1827
State
Action
Neural Net Q Value State Neural Net
Q
Value(a1)
Q
Value(an)
…or
1. 2.

Given Transition < 𝑠, 𝑎, 𝑟, 𝑠M >
1. Do feed forward for all actions in state s
2. Get max. Q value for all actions in state s’
3. Set target value for 𝑄 𝑠, 𝑎 = 𝑟 + 𝛾 𝑚𝑎𝑥𝑄(𝑠M, 𝑎M)
4. Update weights using back propagation
Deep Q networks Update Rules

Small changes in Q-value could cause a totally different action
selection.
No convergence guarantee.
Tries to find deterministic value function, some problems require
a stochastic value function
Q Learning Pre-requisites / Limitations
11/12/1829

Some examples …

OpenAI
Ready to use environments for agent &
algorithm development
• Computing problems
• Games
• Robots
• 2D Problems
Find the optimal model/policy/value function
for a problem
• Model based (unlimited action, observation, reward space)
• Value/Policy based (limited action, observation, reward
space)
Functions:
• Ot0 = reset()
• Ot1, Rt1, Se
t = step(At0)
An Agent Development Environment
Controller
(agent)
OpenAI
System
(environment)
Ot Rt At
Sa
t
Se
t
Value
PolicyModel
11/12/1831

OpenAI
Environments for Advanced Algorithm Development
Acrobot CartPole Car over mountain Pendulum
Humanoid stand upTennisCar Race Lunar Lander
Humanoid
Robot
11/12/1832
Computational alg.

Reinforcement Learning Balance inverted pendulum
• Simplified for 1 dimension
State
• Cart position [-2.4, 2.4]
• Cart Velocity [-inf, inf]
• Pole Angle [-41°, 41°]
• Pole velocity at tip [-inf, inf]
Actions
• Impacts cart direction & velocity
• Push cart to left
• Push cart to right
Termination
• Cart position at boundary (fails)
• Angle outside [-12, 12] (fails)
• More than 200 steps (terminates successfully)
Reward
• +1 for every step not terminating
Example: CartPole
By using random actions pole returns to stable state
11/12/1833

Example: CartPole Solving with model based algorithm (RandomForest)

Example: CartPole Solving with model based algorithm (Neural Network)

Copy characters from observation tape to output tape
•Various character sets [A..[
•Different string length increasing during different runs
State
•Character observed at read head
Actions
•Move read head left or right
•Copy character to output tape or not
Termination
•Wrong character written (fails)
•Timeout after some amount of unsuccessful trials (fails)
•All characters written to output tape (terminates successfully)
Reward
•+1 for correct character written
•-0.5 for wrong character written
•0 for plain head movements
Example: Copy (Algorithm environment)
11/12/1836

Example: Copy (Algorithm environment) - Solved with discrete Q-Learning
Successfully learned to copy
strings of random length and
content

Reinforcement Learning Under powered car to go across a hill
• You have to go backward to get enough swing
State
• Position on x axis
• Speed
Actions
• Push forward
• Push backward
• Do nothing
Termination
• Time after 200 steps
• Car reaches the flag on the hill
Reward
• -1 for every step
• +0.5 for right push & speed > 0
• +0.5 for left push & speed < 0
Example: Mountain Car
11/12/1838

Mountain Car Videos
Random Walk Training phase

Reinforcement Learning Land space ship on the moon
• Land in landing zone
• Surface and start condition change
State
• 8 real values (position, angle, speed…)
Actions
• Fire main engine
• Fire left/right engine
Termination
• Move to landing pad with zero speed
• Can also land outside landing pad
Reward
• Firing main engine -0.3 (unlimited fuel)
• Ground contact +10
• Landing in pad 100-140
Example: Lunar Lander
11/12/1840

Random Walk Training phase

Some final words …

• Direct reward gives insufficient feed back on success strategyDelayed Rewards
• e.g. states using pictures & complex sensors
• Requires deep learning
Continuous/Large
observations states
• Slows down solution convergence
• Find (sub-)optimal solution
Exploration/exploitation
strategies
• Only a mix of policy, value, model and q-function solves most problems
• Standard supervised algorithms do not solve the problemMeta solution strategies
Problems & Research Areas
11/12/1844

Takeaways …

Thank you
Questions & Answers
norbert.kraft@nokia-bell-labs.com

Reinforcement Learning - Learning from Experience like a Human

More Related Content

Similar to Reinforcement Learning - Learning from Experience like a Human

More from Rising Media Ltd.

Recently uploaded

Reinforcement Learning - Learning from Experience like a Human