REINFORCEMENT LEARNING
(USAGES & PROBLEMS)
Zahra Khoobi
Zahra.khoobi71@gmail.com
KNTU university
Fall 2017
Outline
■ Definition & history
■ Usages
■ Open problems
■ References
2
Definition
■ Learn of a behavior strategy (a policy) which maximizes the
long term sum of rewards (delayed reward) by a direct
interaction (trial-and-error) with an unknown and
uncertain environment.
3
History…
■ Studies of animal learning (1911)
– The law of effect [Thorndike, 1911]
■ Operant conditioning [Skinner, 1938]
– process by which humans and animals learn to
behave in such a way as to obtain rewards and avoid
punishments
■ Bellman formulation [1957 Bellman]
– this recursive formula provides the utility of
following certain policy expecting the highest reward
■ Q-Learning [Watkins, 1989. Ph.D. thesis]
– It solve the problem by calculation quantity of state
to actions
4
Usages
■ Fanuc’s robots
■ Recommendation systems
■ Power management
■ Smart Energy Storage Source
■ Deep Q-Network
5
Fanuc’s robots
■ Clever industrial robot (March 18, 2016)
■ Give the robot a task
■ It will spend the night figuring out how to do it.
■ Come morning, the machine should have
mastered the job as well as if it had been
programmed by an expert.
■ So we don’t need to programed it for any special
usages (it works with 90% correctness)
6
Fanuc’s robots (cont.)
■ Japanese Robotics Giant Gives Its Arms Some Brains
(October 7, 2016)
■ Today’s industrial bots are typically programmed to do a
single job very precisely and accurately
■ Deal with:
– Nvidia (Graphic processing) & Silicon Valley chipmaker
■ Deep neural network that controls a robotic arm’s
movement
■ Connecting its robots to the cloud (shared knowledge)
7
Fanuc’s robots (cont.)
8
Recommendation Systems
■ ε-greedy policy combined with SARSA (powerful method of
reinforcement learning)
■ Global & local model
9
Power management
■ Good power management controllers
should be able to observe, learn and
adapt to different hardware systems and
different working environments.
■ Previous work:
– Stochastic approach
– Supervised Learning
– An on-line learning algorithm
■ dynamically selects the best DPM
policies from a set of candidate policies
called experts
■ Low performance 10
Power management (cont.)
■ Use enhanced Q-learning algorithm
■ Converging to a better power management policy in
changing environment.
■ Provides 40% and 90% reduction in power and latency
respectively.
11
Power management (cont.)
■ Enhanced Q-Learning
– Modified Cost Function with Latency Constraint
(reword)
– Learning in the Observation Domain
– Structure in Cost Function to Reduce Search Space
(policy)
12
Smart Energy Storage Source
■ Use the past experience of the intelligent battery agent
and for the next hour choose the appropriate action from
charge and discharge.
■ Previous work:
– Fuzzy logic
– Genetic algorithm to smart management
– Constrained optimization with use some heuristics
■ Don’t notice to nondeterministic feature of nature
13
Smart Energy Storage Source (cont.)
■ Q-learning
■ State:
– Where t L , is the consumer load at
time step t and P is available wind
power output.
■ Action:
■ Reward function:
– P: power
– L: load of battery
– R: amount of charge
14
Deep Q-Network
■ humans and other animals seem to solve this problem through a
harmonious combination of reinforcement learning and hierarchical
sensory processing systems
■ tested this agent on the challenging domain of classic Atari 2600 games
■ bridges the divide between high-dimensional sensory inputs and
actions
■ To use reinforcement
learning successfully in
situations approaching real-
world complexity
■ Environment: high-
dimensional sensory inputs
15
Open problems
■ MIT include RL in their list of top 10 technologies of 2017
■ But also has some difficulties:
– Multi-Task Learning
– Learning to Remember
– Safe and Effective Exploration
16
Open problems(1)
■ Multi-Task Learning
– Perform many different type of tasks
– Build up a library of general knowledge and learn
general skills that can be used across a variety of tasks
– While DQN can play a large number of Atari games,
there is no learning across tasks
– The core of this challenge is scalability
17
Open problems(2)
■ Learning to Remember
– For many real-world tasks, an observation only
captures a small part of the full environment state
– So should remember past observations in order to
determine the best action
– Example:
■ consider an intelligent agent in the workplace that helps a
company support team …
18
Open problems(2)
■ Learning to Remember
– Remembering everything in a conversation (important
things)
– move from topic-to-topic, changing the subject and
looping back again
19
Open problems(3)
■ Safe and Effective Exploration
– In real world learning
■ Driving -> learn but very carefully
– With complex set of action
■ Assembling the car -> impossible for random learning to get
true reward with normal resources
20
Some idea for solving
■ imitation learning
– Human demonstrates what is good behavior
■ intrinsic motivation
– Add internal reword
■ Challenge to move between them
■ Hierarchical learning
– Decompose task to subtask for decrease complexity
21
References
1. https://www.technologyreview.com/s/601045/this-factory-robot-learns-a-new-job-
overnight/
2. https://www.technologyreview.com/s/602553/japanese-robotics-giant-gives-its-arms-some-
brains/
3. https://towardsdatascience.com/from-classic-ai-techniques-to-deep-learning-753d20cf8578
4. New Recommendation System Using Reinforcement Learning
(https://pdfs.semanticscholar.org/f041/ac53fba83674a23e0a4a3454f73b6112fe3c.pdf)
5. http://ieeexplore.ieee.org/document/7827771/
6. https://www.nature.com/articles/nature14236
7. http://www.maluuba.com/blog/2017/3/14/the-next-challenges-for-reinforcement-learning
22
Questions?
23
Thanks for your attention!
60 years ago
24

Reinforcement learning

  • 1.
    REINFORCEMENT LEARNING (USAGES &PROBLEMS) Zahra Khoobi Zahra.khoobi71@gmail.com KNTU university Fall 2017
  • 2.
    Outline ■ Definition &history ■ Usages ■ Open problems ■ References 2
  • 3.
    Definition ■ Learn ofa behavior strategy (a policy) which maximizes the long term sum of rewards (delayed reward) by a direct interaction (trial-and-error) with an unknown and uncertain environment. 3
  • 4.
    History… ■ Studies ofanimal learning (1911) – The law of effect [Thorndike, 1911] ■ Operant conditioning [Skinner, 1938] – process by which humans and animals learn to behave in such a way as to obtain rewards and avoid punishments ■ Bellman formulation [1957 Bellman] – this recursive formula provides the utility of following certain policy expecting the highest reward ■ Q-Learning [Watkins, 1989. Ph.D. thesis] – It solve the problem by calculation quantity of state to actions 4
  • 5.
    Usages ■ Fanuc’s robots ■Recommendation systems ■ Power management ■ Smart Energy Storage Source ■ Deep Q-Network 5
  • 6.
    Fanuc’s robots ■ Cleverindustrial robot (March 18, 2016) ■ Give the robot a task ■ It will spend the night figuring out how to do it. ■ Come morning, the machine should have mastered the job as well as if it had been programmed by an expert. ■ So we don’t need to programed it for any special usages (it works with 90% correctness) 6
  • 7.
    Fanuc’s robots (cont.) ■Japanese Robotics Giant Gives Its Arms Some Brains (October 7, 2016) ■ Today’s industrial bots are typically programmed to do a single job very precisely and accurately ■ Deal with: – Nvidia (Graphic processing) & Silicon Valley chipmaker ■ Deep neural network that controls a robotic arm’s movement ■ Connecting its robots to the cloud (shared knowledge) 7
  • 8.
  • 9.
    Recommendation Systems ■ ε-greedypolicy combined with SARSA (powerful method of reinforcement learning) ■ Global & local model 9
  • 10.
    Power management ■ Goodpower management controllers should be able to observe, learn and adapt to different hardware systems and different working environments. ■ Previous work: – Stochastic approach – Supervised Learning – An on-line learning algorithm ■ dynamically selects the best DPM policies from a set of candidate policies called experts ■ Low performance 10
  • 11.
    Power management (cont.) ■Use enhanced Q-learning algorithm ■ Converging to a better power management policy in changing environment. ■ Provides 40% and 90% reduction in power and latency respectively. 11
  • 12.
    Power management (cont.) ■Enhanced Q-Learning – Modified Cost Function with Latency Constraint (reword) – Learning in the Observation Domain – Structure in Cost Function to Reduce Search Space (policy) 12
  • 13.
    Smart Energy StorageSource ■ Use the past experience of the intelligent battery agent and for the next hour choose the appropriate action from charge and discharge. ■ Previous work: – Fuzzy logic – Genetic algorithm to smart management – Constrained optimization with use some heuristics ■ Don’t notice to nondeterministic feature of nature 13
  • 14.
    Smart Energy StorageSource (cont.) ■ Q-learning ■ State: – Where t L , is the consumer load at time step t and P is available wind power output. ■ Action: ■ Reward function: – P: power – L: load of battery – R: amount of charge 14
  • 15.
    Deep Q-Network ■ humansand other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems ■ tested this agent on the challenging domain of classic Atari 2600 games ■ bridges the divide between high-dimensional sensory inputs and actions ■ To use reinforcement learning successfully in situations approaching real- world complexity ■ Environment: high- dimensional sensory inputs 15
  • 16.
    Open problems ■ MITinclude RL in their list of top 10 technologies of 2017 ■ But also has some difficulties: – Multi-Task Learning – Learning to Remember – Safe and Effective Exploration 16
  • 17.
    Open problems(1) ■ Multi-TaskLearning – Perform many different type of tasks – Build up a library of general knowledge and learn general skills that can be used across a variety of tasks – While DQN can play a large number of Atari games, there is no learning across tasks – The core of this challenge is scalability 17
  • 18.
    Open problems(2) ■ Learningto Remember – For many real-world tasks, an observation only captures a small part of the full environment state – So should remember past observations in order to determine the best action – Example: ■ consider an intelligent agent in the workplace that helps a company support team … 18
  • 19.
    Open problems(2) ■ Learningto Remember – Remembering everything in a conversation (important things) – move from topic-to-topic, changing the subject and looping back again 19
  • 20.
    Open problems(3) ■ Safeand Effective Exploration – In real world learning ■ Driving -> learn but very carefully – With complex set of action ■ Assembling the car -> impossible for random learning to get true reward with normal resources 20
  • 21.
    Some idea forsolving ■ imitation learning – Human demonstrates what is good behavior ■ intrinsic motivation – Add internal reword ■ Challenge to move between them ■ Hierarchical learning – Decompose task to subtask for decrease complexity 21
  • 22.
    References 1. https://www.technologyreview.com/s/601045/this-factory-robot-learns-a-new-job- overnight/ 2. https://www.technologyreview.com/s/602553/japanese-robotics-giant-gives-its-arms-some- brains/ 3.https://towardsdatascience.com/from-classic-ai-techniques-to-deep-learning-753d20cf8578 4. New Recommendation System Using Reinforcement Learning (https://pdfs.semanticscholar.org/f041/ac53fba83674a23e0a4a3454f73b6112fe3c.pdf) 5. http://ieeexplore.ieee.org/document/7827771/ 6. https://www.nature.com/articles/nature14236 7. http://www.maluuba.com/blog/2017/3/14/the-next-challenges-for-reinforcement-learning 22
  • 23.
  • 24.
    Thanks for yourattention! 60 years ago 24

Editor's Notes

  • #5 Operant conditioning (or instrumental conditioning): process by which humans and animals learn to behave in such a way as to obtain rewards and avoid punishments [Skinner, 1938]. شرطی سازی
  • #7 nside a modest-looking office building in Tokyo lives an unusually clever industrial robot made by the Japanese company Fanuc. Give the robot a task, like picking widgets out of one box and putting them into another container, and it will spend the night figuring out how to do it. Come morning, the machine should have mastered the job as well as if it had been programmed by an expert. Fanuc demonstrates a robot trained through reinforcement learning at the International Robot Exhibition in Tokyo in December.Industrial robots are capable of extreme precision and speed, but they normally need to be programmed very carefully in order to do something like grasp an object. This is difficult and time-consuming, and it means that such robots can usually work only in tightly controlled environments. Fanuc’s robot uses a technique known as deep reinforcement learning to train itself, over time, how to learn a new task. It tries picking up objects while capturing video footage of the process. Each time it succeeds or fails, it remembers how the object looked, knowledge that is used to refine a deep learning model, or a large neural network, that controls its action. Deep learning has proved to be a powerful approach in pattern recognition over the past few years. “After eight hours or so it gets to 90 percent accuracy or above, which is almost the same as if an expert were to program it,” explains Shohei Hido, chief research officer at Preferred Networks, a Tokyo-based company specializing in machine learning. “It works overnight; the next morning it is tuned.” Robotics researchers are testing reinforcement learning as a way to simplify and speed up the programming of robots that do factory work. Earlier this month, Google published details of its own research on using reinforcement learning to teach robots how to grasp objects. The Fanuc robot was programmed by Preferred Networks. Fanuc, the world’s largest maker of industrial robots, invested $7.3 million in Preferred Networks in August last year. The companies demonstrated the learning robot at the International Robot Exhibition in Tokyo last December. One of the big potential benefits of the learning approach, Hido says, is that it can be accelerated if several robots work in parallel and then share what they have learned. So eight robots working together for one hour can perform the same learning as one machine going for eight hours. “Our project is oriented to distributed learning,” Hido says. “You can imagine hundreds of factory robots sharing information.” This form of distributed learning, sometimes called “cloud robotics,” is shaping up to be a big trend both in research and industry (see “10 Breakthrough Technologies 2016: Robots That Teach Each Other”). “Fanuc is well placed to think about this,” says Ken Goldberg, a professor of robotics at the University of California, Berkeley, because it installs so many machines in factories around the world. He adds that cloud robotics will most likely reshape the way robots are used in the coming years. Goldberg and colleagues (including several researchers at Google) are in fact taking this a step further by teaching robots how certain movements may be used to grasp not just specific objects but certain shapes. A paper on this work will appear at the IEEE International Conference on Robotics and Automation in May. However, Goldberg notes, applying machine learning to robotics is challenging because controlling behavior is more complex than, say, recognizing objects in images. “Deep learning has made enormous progress in pattern recognition,” Goldberg says. “The challenge with robotics is that you’re doing something beyond that. You need to be able to generate the appropriate actions for a huge range of inputs.” Fanuc may not be the only company developing robots that use machine learning. In 2014, the Swiss robot maker ABB invested in another AI startup called Vicarious. The fruits of that investment have yet to appear, however.
  • #11 Many existing works focus on the stochastic nature of the power management problem [1]~[4]. However, their techniques require offline system modeling and policy optimization and hence are not adaptive. Reference [5] proposes a user-based adaptive management technique that considers user annoyance as a performance constraint. However, this approach requires offline training, which is not suitable in a changing environment. Each expert has a weight factor, the value of which indicates the benefit gained if the correspondent expert was chosen during the last idle period. The one with the highest value will control the device for the next idle period. Reference [7] proposes a similar approach using a different learning algorithm. The expert-based machine learning algorithm is able to find an appropriate DPM policy in short time without any prior workload information. However, it cannot explore the power performance trade-offs effectively.
  • #13 C(s,a) = power consumption D(s,a) = delay O ={o1,o2,…} set of observation service requestor (􀣭􀣬), a service provider (􀣭􀣪), and a service queue (􀣭􀣫).
  • #14 انرژی های باد و طبیعت و تجدید پذیر ها مدیریت و ذخیره سازی درست برای جلوگیری از اتلاف هنگام توزیع و نیز ذخیره سازی مناسب برای استفاده در زمان اوج مصرف
  • #15 Reward function should been read
  • #16 The theory of reinforcement learning provides a normative account1, deeply rooted in psychological2 and neuroscientific3 perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems4,5, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms3. While reinforcement learning agents have achieved some successes in a variety of domains6,7,8, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks9,10,11 to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games12. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
  • #19 اول کالا رو انتخاب می کنه آدرس رو می گه و بعد و وسط راهنمایی آدرس در مورد قیمت کالا می پرسه
  • #20 Remembering everything in a conversation, however, makes learning a good policy intractable. As humans speak we move from topic-to-topic, changing the subject and looping back again. Some information is very important whereas other information is more tangential. Hence, the challenge is to learn a compact representation that only stores the most salient information. 
  • #22 تقلید از انسان انگیزه ی درونی یادگیری سلسله مراتبی