Reinforcement learning

REINFORCEMENT LEARNING
(USAGES & PROBLEMS)
Zahra Khoobi
Zahra.khoobi71@gmail.com
KNTU university
Fall 2017

Outline
■ Definition & history
■ Usages
■ Open problems
■ References
2

Definition
■ Learn of a behavior strategy (a policy) which maximizes the
long term sum of rewards (delayed reward) by a direct
interaction (trial-and-error) with an unknown and
uncertain environment.
3

History…
■ Studies of animal learning (1911)
– The law of effect [Thorndike, 1911]
■ Operant conditioning [Skinner, 1938]
– process by which humans and animals learn to
behave in such a way as to obtain rewards and avoid
punishments
■ Bellman formulation [1957 Bellman]
– this recursive formula provides the utility of
following certain policy expecting the highest reward
■ Q-Learning [Watkins, 1989. Ph.D. thesis]
– It solve the problem by calculation quantity of state
to actions
4

Usages
■ Fanuc’s robots
■ Recommendation systems
■ Power management
■ Smart Energy Storage Source
■ Deep Q-Network
5

Fanuc’s robots
■ Clever industrial robot (March 18, 2016)
■ Give the robot a task
■ It will spend the night figuring out how to do it.
■ Come morning, the machine should have
mastered the job as well as if it had been
programmed by an expert.
■ So we don’t need to programed it for any special
usages (it works with 90% correctness)
6

Fanuc’s robots (cont.)
■ Japanese Robotics Giant Gives Its Arms Some Brains
(October 7, 2016)
■ Today’s industrial bots are typically programmed to do a
single job very precisely and accurately
■ Deal with:
– Nvidia (Graphic processing) & Silicon Valley chipmaker
■ Deep neural network that controls a robotic arm’s
movement
■ Connecting its robots to the cloud (shared knowledge)
7

Recommendation Systems
■ ε-greedy policy combined with SARSA (powerful method of
reinforcement learning)
■ Global & local model
9

Power management
■ Good power management controllers
should be able to observe, learn and
adapt to different hardware systems and
different working environments.
■ Previous work:
– Stochastic approach
– Supervised Learning
– An on-line learning algorithm
■ dynamically selects the best DPM
policies from a set of candidate policies
called experts
■ Low performance 10

Power management (cont.)
■ Use enhanced Q-learning algorithm
■ Converging to a better power management policy in
changing environment.
■ Provides 40% and 90% reduction in power and latency
respectively.
11

Power management (cont.)
■ Enhanced Q-Learning
– Modified Cost Function with Latency Constraint
(reword)
– Learning in the Observation Domain
– Structure in Cost Function to Reduce Search Space
(policy)
12

Smart Energy Storage Source
■ Use the past experience of the intelligent battery agent
and for the next hour choose the appropriate action from
charge and discharge.
■ Previous work:
– Fuzzy logic
– Genetic algorithm to smart management
– Constrained optimization with use some heuristics
■ Don’t notice to nondeterministic feature of nature
13

Smart Energy Storage Source (cont.)
■ Q-learning
■ State:
– Where t L , is the consumer load at
time step t and P is available wind
power output.
■ Action:
■ Reward function:
– P: power
– L: load of battery
– R: amount of charge
14

Deep Q-Network
■ humans and other animals seem to solve this problem through a
harmonious combination of reinforcement learning and hierarchical
sensory processing systems
■ tested this agent on the challenging domain of classic Atari 2600 games
■ bridges the divide between high-dimensional sensory inputs and
actions
■ To use reinforcement
learning successfully in
situations approaching real-
world complexity
■ Environment: high-
dimensional sensory inputs
15

Open problems
■ MIT include RL in their list of top 10 technologies of 2017
■ But also has some difficulties:
– Multi-Task Learning
– Learning to Remember
– Safe and Effective Exploration
16

Open problems(1)
■ Multi-Task Learning
– Perform many different type of tasks
– Build up a library of general knowledge and learn
general skills that can be used across a variety of tasks
– While DQN can play a large number of Atari games,
there is no learning across tasks
– The core of this challenge is scalability
17

Open problems(2)
■ Learning to Remember
– For many real-world tasks, an observation only
captures a small part of the full environment state
– So should remember past observations in order to
determine the best action
– Example:
■ consider an intelligent agent in the workplace that helps a
company support team …
18

Open problems(2)
■ Learning to Remember
– Remembering everything in a conversation (important
things)
– move from topic-to-topic, changing the subject and
looping back again
19

Open problems(3)
■ Safe and Effective Exploration
– In real world learning
■ Driving -> learn but very carefully
– With complex set of action
■ Assembling the car -> impossible for random learning to get
true reward with normal resources
20

Some idea for solving
■ imitation learning
– Human demonstrates what is good behavior
■ intrinsic motivation
– Add internal reword
■ Challenge to move between them
■ Hierarchical learning
– Decompose task to subtask for decrease complexity
21

References
1. https://www.technologyreview.com/s/601045/this-factory-robot-learns-a-new-job-
overnight/
2. https://www.technologyreview.com/s/602553/japanese-robotics-giant-gives-its-arms-some-
brains/
3. https://towardsdatascience.com/from-classic-ai-techniques-to-deep-learning-753d20cf8578
4. New Recommendation System Using Reinforcement Learning
(https://pdfs.semanticscholar.org/f041/ac53fba83674a23e0a4a3454f73b6112fe3c.pdf)
5. http://ieeexplore.ieee.org/document/7827771/
6. https://www.nature.com/articles/nature14236
7. http://www.maluuba.com/blog/2017/3/14/the-next-challenges-for-reinforcement-learning
22

Thanks for your attention!
60 years ago
24

Reinforcement learning

More Related Content

Similar to Reinforcement learning

Recently uploaded

Reinforcement learning

Editor's Notes