Reinforcement learning is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.
Deep Reinforcement Learning Talk at PI School. Covering following contents as:
1- Deep Reinforcement Learning
2- QLearning
3- Deep QLearning (DQN)
4- Google Deepmind Paper (DQN for ATARI)
In some applications, the output of the system is a sequence of actions. In such a case, a single action is not important
game playing where a single move by itself is not that important.in the case of the agent acts on its environment, it receives some evaluation of its action (reinforcement),
but is not told of which action is the correct one to achieve its goal
Reinforcement Learning (RL) approaches to deal with finding an optimal reward based policy to act in an environment (Charla en Inglés)
However, what has led to their widespread use is its combination with deep neural networks (DNN) i.e., deep reinforcement learning (Deep RL). Recent successes on not only learning to play games but also superseding humans in it and academia-industry research collaborations like for manipulation of objects, locomotion skills, smart grids, etc. have surely demonstrated their case on a wide variety of challenging tasks.
With application spanning across games, robotics, dialogue, healthcare, marketing, energy and many more domains, Deep RL might just be the power that drives the next generation of Artificial Intelligence (AI) agents!
This presentation contains an introduction to reinforcement learning, comparison with others learning ways, introduction to Q-Learning and some applications of reinforcement learning in video games.
Deep Reinforcement Learning Talk at PI School. Covering following contents as:
1- Deep Reinforcement Learning
2- QLearning
3- Deep QLearning (DQN)
4- Google Deepmind Paper (DQN for ATARI)
In some applications, the output of the system is a sequence of actions. In such a case, a single action is not important
game playing where a single move by itself is not that important.in the case of the agent acts on its environment, it receives some evaluation of its action (reinforcement),
but is not told of which action is the correct one to achieve its goal
Reinforcement Learning (RL) approaches to deal with finding an optimal reward based policy to act in an environment (Charla en Inglés)
However, what has led to their widespread use is its combination with deep neural networks (DNN) i.e., deep reinforcement learning (Deep RL). Recent successes on not only learning to play games but also superseding humans in it and academia-industry research collaborations like for manipulation of objects, locomotion skills, smart grids, etc. have surely demonstrated their case on a wide variety of challenging tasks.
With application spanning across games, robotics, dialogue, healthcare, marketing, energy and many more domains, Deep RL might just be the power that drives the next generation of Artificial Intelligence (AI) agents!
This presentation contains an introduction to reinforcement learning, comparison with others learning ways, introduction to Q-Learning and some applications of reinforcement learning in video games.
Deep Reinforcement Learning: Q-LearningKai-Wen Zhao
This slide reviews deep reinforcement learning, specially Q-Learning and its variants. We introduce Bellman operator and approximate it with deep neural network. Last but not least, we review the classical paper: DeepMind Atari Game beats human performance. Also, some tips of stabilizing DQN are included.
Reinforcement Learning 4. Dynamic ProgrammingSeung Jae Lee
A summary of Chapter 4: Dynamic Programming of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html
( Machine Learning & Deep Learning Specialization Training: https://goo.gl/5u2RiS )
This CloudxLab Reinforcement Learning tutorial helps you to understand Reinforcement Learning in detail. Below are the topics covered in this tutorial:
1) What is Reinforcement?
2) Reinforcement Learning an Introduction
3) Reinforcement Learning Example
4) Learning to Optimize Rewards
5) Policy Search - Brute Force Approach, Genetic Algorithms and Optimization Techniques
6) OpenAI Gym
7) The Credit Assignment Problem
8) Inverse Reinforcement Learning
9) Playing Atari with Deep Reinforcement Learning
10) Policy Gradients
11) Markov Decision Processes
Lecture slides in DASI spring 2018, National Cheng Kung University, Taiwan. The content is about deep reinforcement learning: policy gradient including variance reduction and importance sampling
The lecture slides in DSAI 2018, National Cheng Kung University. It's about famous deep reinforcement learning algorithm: Actor-Critc. In this slides, we introduce advantage function, A3C/A2C.
Slides from my presentation of Richard Sutton and Andrew Barto's "Introduction to Reinforcement Learning Chapter 1"
Video (https://www.youtube.com/watch?v=4SLGEq_HZxk&t=2s)
Hill Climbing Algorithm in Artificial IntelligenceBharat Bhushan
Hill Climbing Algorithm in Artificial Intelligence
Hill climbing algorithm is a local search algorithm which continuously moves in the direction of increasing elevation/value to find the peak of the mountain or best solution to the problem. It terminates when it reaches a peak value where no neighbor has a higher value.
Hill climbing algorithm is a technique which is used for optimizing the mathematical problems. One of the widely discussed examples of Hill climbing algorithm is Traveling-salesman Problem in which we need to minimize the distance traveled by the salesman.
It is also called greedy local search as it only looks to its good immediate neighbor state and not beyond that.
A node of hill climbing algorithm has two components which are state and value.
Hill Climbing is mostly used when a good heuristic is available.
In this algorithm, we don't need to maintain and handle the search tree or graph as it only keeps a single current state.
Features of Hill Climbing:
Following are some main features of Hill Climbing Algorithm:
Generate and Test variant: Hill Climbing is the variant of Generate and Test method. The Generate and Test method produce feedback which helps to decide which direction to move in the search space.
Greedy approach: Hill-climbing algorithm search moves in the direction which optimizes the cost.
No backtracking: It does not backtrack the search space, as it does not remember the previous states.
State-space Diagram for Hill Climbing:
The state-space landscape is a graphical representation of the hill-climbing algorithm which is showing a graph between various states of algorithm and Objective function/Cost.
On Y-axis we have taken the function which can be an objective function or cost function, and state-space on the x-axis. If the function on Y-axis is cost then, the goal of search is to find the global minimum and local minimum. If the function of Y-axis is Objective function, then the goal of the search is to find the global maximum and local maximum.
We propose a distributed deep learning model to learn control policies directly from high-dimensional sensory input using reinforcement learning (RL). We adapt the DistBelief software framework to efficiently train the deep RL agents using the Apache Spark cluster computing framework.
Deep Reinforcement Learning: Q-LearningKai-Wen Zhao
This slide reviews deep reinforcement learning, specially Q-Learning and its variants. We introduce Bellman operator and approximate it with deep neural network. Last but not least, we review the classical paper: DeepMind Atari Game beats human performance. Also, some tips of stabilizing DQN are included.
Reinforcement Learning 4. Dynamic ProgrammingSeung Jae Lee
A summary of Chapter 4: Dynamic Programming of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html
( Machine Learning & Deep Learning Specialization Training: https://goo.gl/5u2RiS )
This CloudxLab Reinforcement Learning tutorial helps you to understand Reinforcement Learning in detail. Below are the topics covered in this tutorial:
1) What is Reinforcement?
2) Reinforcement Learning an Introduction
3) Reinforcement Learning Example
4) Learning to Optimize Rewards
5) Policy Search - Brute Force Approach, Genetic Algorithms and Optimization Techniques
6) OpenAI Gym
7) The Credit Assignment Problem
8) Inverse Reinforcement Learning
9) Playing Atari with Deep Reinforcement Learning
10) Policy Gradients
11) Markov Decision Processes
Lecture slides in DASI spring 2018, National Cheng Kung University, Taiwan. The content is about deep reinforcement learning: policy gradient including variance reduction and importance sampling
The lecture slides in DSAI 2018, National Cheng Kung University. It's about famous deep reinforcement learning algorithm: Actor-Critc. In this slides, we introduce advantage function, A3C/A2C.
Slides from my presentation of Richard Sutton and Andrew Barto's "Introduction to Reinforcement Learning Chapter 1"
Video (https://www.youtube.com/watch?v=4SLGEq_HZxk&t=2s)
Hill Climbing Algorithm in Artificial IntelligenceBharat Bhushan
Hill Climbing Algorithm in Artificial Intelligence
Hill climbing algorithm is a local search algorithm which continuously moves in the direction of increasing elevation/value to find the peak of the mountain or best solution to the problem. It terminates when it reaches a peak value where no neighbor has a higher value.
Hill climbing algorithm is a technique which is used for optimizing the mathematical problems. One of the widely discussed examples of Hill climbing algorithm is Traveling-salesman Problem in which we need to minimize the distance traveled by the salesman.
It is also called greedy local search as it only looks to its good immediate neighbor state and not beyond that.
A node of hill climbing algorithm has two components which are state and value.
Hill Climbing is mostly used when a good heuristic is available.
In this algorithm, we don't need to maintain and handle the search tree or graph as it only keeps a single current state.
Features of Hill Climbing:
Following are some main features of Hill Climbing Algorithm:
Generate and Test variant: Hill Climbing is the variant of Generate and Test method. The Generate and Test method produce feedback which helps to decide which direction to move in the search space.
Greedy approach: Hill-climbing algorithm search moves in the direction which optimizes the cost.
No backtracking: It does not backtrack the search space, as it does not remember the previous states.
State-space Diagram for Hill Climbing:
The state-space landscape is a graphical representation of the hill-climbing algorithm which is showing a graph between various states of algorithm and Objective function/Cost.
On Y-axis we have taken the function which can be an objective function or cost function, and state-space on the x-axis. If the function on Y-axis is cost then, the goal of search is to find the global minimum and local minimum. If the function of Y-axis is Objective function, then the goal of the search is to find the global maximum and local maximum.
We propose a distributed deep learning model to learn control policies directly from high-dimensional sensory input using reinforcement learning (RL). We adapt the DistBelief software framework to efficiently train the deep RL agents using the Apache Spark cluster computing framework.
1118_Seminar_Continuous_Deep Q-Learning with Model based accelerationHye-min Ahn
The material that I've used to present the paper
"Continuous Deep Q-Learning with Model-based Acceleration", S.Gu, T.Lillicrap, I.Sutskever, S.Levine, 2016 ICML
Details of Lazy Deep Learning for Images Recognition in ZZ Photo appPAY2 YOU
В докладе представлена тема глубокого обучения (Deep Learning) для распознавания изображений. Рассматриваются практические аспекты обучения глубоких сверточных сетей на GPU, обсуждается личный опыт портирования обученных нейросетей в приложение на основе библиотеки OpenCV, проводится сравнение полученного детектора домашних животных на основе подхода Lazy Deep Learning с детектором Виолы-Джонса.
Докладчики: Артем Чернодуб – эксперт в области искусственных нейронных сетей и систем искусственного интеллекта. В 2007 году закончил Московский физико-технический институт. Руководит направлением Computer Vision в компании ZZ Wolf, а также по совместительству работает научным сотрудником в Институте проблем математических машин и систем НАНУ.
Юрий Пащенко – специалист в области систем машинного зрения и машинного обучения, магистр НТУУ «Киевский Политехнический Институт», факультет прикладной математики (2014). Работает в компании ZZ Wolf на должности R&D Engineer.
What one needs to know to work in Natural Language Processing field and the aspects of developing an NLP project using the example of a system to identify text language
This chapter shows how to use knowledge about the wlorld to make decisions even when the
outcomes of an action are uncertain and the rewards for acting might not be reaped until many
actions have passed. The main points are as follows:
e Sequential decision problems in uncertain envirsinments,also called Markov decision
processes, or MDPs, are defined by a transition model specifying the probabilistic
outcomes of actions and a reward function specifying the reward in each state.
o The utility of a state sequence is the sum of all the rewards over the sequence, possibly
discounted over time. The solution of an MDP is a policy that associates a decision
with every state that the agent might reach. An optimal policy maximizes the utility of
the state sequences encountered when it is execut~ed.
e The utility of a state is the expected utility of the state sequences encountered when
an optimal policy is executed, starting in that state. The value iteration algorithm for
solving MDPs works by iteratively solving the equations relating the utilities of each
state to that of its neighbors.
Policy iteration alternates between calculating the utilities of states under the current
policy and improving the current policy with respect to the current utilities.
* Partially observable MDPs, or POMDPs, are much more difficult to solve than are
MDPs. They can be solved by conversion to an MDP in the continuous space of belief
states. Optimal behavior in POMDPs includes information gathering to reduce uncertainty and therefore make better decisions in the fiuture.
A decision-theoretic agent can be constructed for POMDP environments. The agent
uses a dynamic decision network to represent the transition and observation models,
to update its belief state, and to project forward possible action sequences.
Game theory describes rational behavior for agents in situations where multiple agents
interact simultaneously. Solutions of games are Nash equilibria-strategy profiles in
which no agent has an incentive to deviate from the specified strategy.
Mechanism design can be used to set the rules by which agents will interact, in order
to maximize some global utility through the operation of individually rational agents.
Sometimes, mechanisms exist that achieve this goal without requiring each agent to
consider the choices made by other agents.
We shall return to the world of MDPs and POMDP in Chapter 21, when we study reinforcement learning methods that allow an agent to improve its behavior from experience in sequential, uncertain environments.
Intro to Reinforcement learning - part IIMikko Mäkipää
Introduction to Reinforcement Learning, part II: Basic tabular methods
This is the second presentation in a three-part series covering the basics of Reinforcement Learning (RL).
In this presentation, we introduce some more building blocks, such as policy iteration, bandits and exploration, epsilon-greedy policies, temporal difference methods.
We introduce basic model-free methods that use tabular value representation; Monte Carlo on- and off-policy, Sarsa, Expected Sarsa, and Q-learning.
The algorithms are illustrated using simple black jack as an environment.
this talk was an introduction to Reinforcement Learning based on the book by Andrew Barto and Richard S. Sutton. We explained the main components of an RL problem and detailed the tabular solutions and approximate solutions methods.
Reinforcement Learning has historically not been as widely adopted in production as other learning approaches (particularly supervised learning), despite being capable of addressing a broader set of problems.
But we are now seeing an growth in production RL applications: so much so that in specific areas it is approaching a tipping point to the mainstream. In this talk, we’ll talk about why this is happening; detail concrete examples of the areas where RL is adding value; and share some practical tips on deploying RL in your organization.
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...Lviv Startup Club
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic pricing with RL
AI & BigData Online Day 2021
Website - http://aiconf.com.ua
Youtube - https://www.youtube.com/startuplviv
FB - https://www.facebook.com/aiconf
Making smart decisions in real-time with Reinforcement LearningRuth Yakubu
The process of reinforcement learning (RL) involves trial and error; rewarding actions; and remembering past experiences overtime. This technique is used when building sequential decision-making solutions like automated self-driving cars, video games or personalized content recommendations. However, some of the challenges in building reinforcement learning models is it takes a long time for the system to learn and getting a high accuracy. In this session, we'll explore different reinforcement learning solutions like how to implement relevant user experiences that improve over time, based on behavior using a pre-built API; and how to build your custom model from scratch in python while increasing the learning speed and final performance using Azure Machine Learning & Ray/RLlib
발표자: 곽동현(서울대 박사과정, 현 NAVER Clova)
강화학습(Reinforcement learning)의 개요 및 최근 Deep learning 기반의 RL 트렌드를 소개합니다.
발표영상:
http://tv.naver.com/v/2024376
https://youtu.be/dw0sHzE1oAc
TensorFlow and Deep Learning Tips and TricksBen Ball
Presented at https://www.meetup.com/TensorFlow-and-Deep-Learning-Singapore/events/241183195/ . Tips and Tricks for using Tensorflow with Deep Reinforcement Learning.
See our blog for more information at http://prediction-machines.com/blog/
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
2. What is Reinforcement Learning?
- Much like biological agents behave
- No supervisor, only a reward
- Data is time dependent (non iid)
- Feedback is delayed
- Agent actions affect the data it receives
3. Examples
- Play checkers (1959)
- Defeat the world champion at Backgammon (1992)
- Control a helicopter (2008)
- Make a robot to walk
- Robocup Soccer
- Play ATARI games better than humans (2014)
- Defeat the world champion at Go (2016)
Videos
4. Reward Hypothesis
All goals can be described by the maximisation of expected cumulative reward
- Defeat the world champion at Go: +R / -R for winning/losing a game
- Make a robot to walk: +R for forward, -R for falling over
- Play ATARI games: +R / -R for increasing/decreasing score
- Control a helicopter: + R / -R following trajectory / crashing
6. Fully Observable Environments
Fully Observable Environments (agent state = environment state):
- Agent directly observes environment
- Example: chess board
Partially Observable Environments (agent state not equal environment state):
- Agent indirectly observes environment
- Example: A robot with motion sensor or camera
- Agent must construct its own state representation
7. RL components: Policy and Value Function
Policy is agent’s behaviour function
- Maps from state to action
- Deterministic policy:
- Stochastic:
Value function is a is a prediction of future reward
- Used to evaluate state and select between actions
-
12. Math: Markov Decision Process (MDP)
Almost all RL problems can be formalised as MDPs
It’s a tuple:
- S is finite set of states
- A is finite set of actions
- P is state transition probability matrix:
- R is a reward function:
- Discount factor:
13. State-Value and Action-Value functions, Bellman eq.
Expected return starting from state s, and then following policy :
Expected return starting from state s, taking action a, and then following policy :
14. Finding an Optimal Policy
- There is always optimal policy for any MPD
- All optimal policies achieve the optimal value function
- All optimal policies achieve the optimal action-value function
All you need is to find
15. Bellman Opt Equation for state-value function
[David Silver. Advanced Topics: RL]
16. Bellman Opt Equation for action-value function
[David Silver. Advanced Topics: RL]
17. Bellman Opt Equation for state-value function
[David Silver. Advanced Topics: RL]
18. Bellman Opt Equation for action-value function
[David Silver. Advanced Topics: RL]
20. Q-Learning - model-free off-policy control algorithm
Model-free (vs Model-based):
- MDP model is unknown, but experience can be sampled MDP
- Model is known, but is too big to use, except by samples
Off-policy (vs On-policy):
- Can learn about policy from experience sampled from some other policy
Control (vs Prediction):
- Find best policy
24. Issues with Q-learning with neural network
- Data is sequential (non-iid)
- Policy changes rapidly with slight changes to Q-values
- Policy may oscillate
- Experience flows from one extreme to another
- Scale of rewards and Q-values is unknown
- Unstable backpropagation due to large gradients
25. DQN solutions
- Use experience replay
- Breaks correlations in data
- Learn from all past policies
- Using off-policy Q-learning
- Freeze target Q-network
- Avoid policy oscillations
- Break correlations between Q-network and target
- Clip rewards and gradients