This document provides an overview of reinforcement learning concepts. It introduces reinforcement learning as using rewards to learn how to maximize utility. It describes Markov decision processes (MDPs) as the framework for modeling reinforcement learning problems, including states, actions, transitions, and rewards. It discusses solving MDPs by finding optimal policies using value iteration or policy iteration algorithms based on the Bellman equations. The goal is to learn optimal state values or action values through interaction rather than relying on a known model of the environment.
Reinforcement learning is a computational approach for learning through interaction without an explicit teacher. An agent takes actions in various states and receives rewards, allowing it to learn relationships between situations and optimal actions. The goal is to learn a policy that maximizes long-term rewards by balancing exploitation of current knowledge with exploration of new actions. Methods like Q-learning use value function approximation and experience replay in deep neural networks to scale to complex problems with large state spaces like video games. Temporal difference learning combines the advantages of Monte Carlo and dynamic programming by bootstrapping values from current estimates rather than waiting for full episodes.
This lecture discusses planning under uncertainty using Markov decision processes (MDPs). Key points:
1. MDPs provide a framework for planning when the world is stochastic and states are observable. The goal is to find an optimal policy that maximizes expected reward.
2. Value iteration is commonly used to solve MDPs by iteratively updating state values until convergence. This allows computing optimal state/action values and policies.
3. Partially observable MDPs (POMDPs) extend MDPs to cases when states are not directly observable. Planning requires reasoning in the space of probability distributions over states, known as the belief space.
Intro to Reinforcement learning - part IIMikko Mäkipää
Introduction to Reinforcement Learning, part II: Basic tabular methods
This is the second presentation in a three-part series covering the basics of Reinforcement Learning (RL).
In this presentation, we introduce some more building blocks, such as policy iteration, bandits and exploration, epsilon-greedy policies, temporal difference methods.
We introduce basic model-free methods that use tabular value representation; Monte Carlo on- and off-policy, Sarsa, Expected Sarsa, and Q-learning.
The algorithms are illustrated using simple black jack as an environment.
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
This document summarizes an efficient use of temporal difference techniques in computer game learning. It discusses reinforcement learning and some key concepts including the agent-environment interface, types of reinforcement learning tasks, elements of reinforcement learning like policy, reward functions, and value functions. It also describes algorithms like dynamic programming, policy iteration, value iteration, and temporal difference learning. Finally, it mentions some applications of reinforcement learning in benchmark problems, games, and real-world domains like robotics and control.
Reinforcement learning is a machine learning technique that involves an agent learning how to achieve a goal in an environment by trial-and-error using feedback in the form of rewards and punishments. The agent learns an optimal behavior or policy for achieving the maximum reward. Key elements of reinforcement learning include the agent, environment, states, actions, policy, reward function, and value function. Reinforcement learning problems can be solved using methods like dynamic programming, Monte Carlo methods, and temporal difference learning.
How to formulate reinforcement learning in illustrative waysYasutoTamura1
This lecture introduces reinforcement learning and how to approach learning it. It discusses formulating the environment as a Markov decision process and defines important concepts like policy, value functions, returns, and the Bellman equation. The key ideas are that reinforcement learning involves optimizing a policy to maximize expected returns, and value functions are introduced to indirectly evaluate and improve the policy through dynamic programming methods like policy iteration and value iteration. Understanding these fundamental concepts through simple examples is emphasized as the starting point for learning reinforcement learning.
Reinforcement learning is a method for learning optimal actions by maximizing rewards from the environment over time. The goal is to learn a policy that chooses actions to maximize the discounted sum of future rewards. This can be modeled as a Markov decision process where the goal is to learn the optimal policy or value function. Q-learning is an algorithm that can learn the optimal policy even when the environment dynamics are unknown, by learning the Q function which estimates the expected discounted reward for state-action pairs. The Q function is learned online through trial-and-error interactions with the environment using a recursive training rule.
Reinforcement learning is a computational approach for learning through interaction without an explicit teacher. An agent takes actions in various states and receives rewards, allowing it to learn relationships between situations and optimal actions. The goal is to learn a policy that maximizes long-term rewards by balancing exploitation of current knowledge with exploration of new actions. Methods like Q-learning use value function approximation and experience replay in deep neural networks to scale to complex problems with large state spaces like video games. Temporal difference learning combines the advantages of Monte Carlo and dynamic programming by bootstrapping values from current estimates rather than waiting for full episodes.
This lecture discusses planning under uncertainty using Markov decision processes (MDPs). Key points:
1. MDPs provide a framework for planning when the world is stochastic and states are observable. The goal is to find an optimal policy that maximizes expected reward.
2. Value iteration is commonly used to solve MDPs by iteratively updating state values until convergence. This allows computing optimal state/action values and policies.
3. Partially observable MDPs (POMDPs) extend MDPs to cases when states are not directly observable. Planning requires reasoning in the space of probability distributions over states, known as the belief space.
Intro to Reinforcement learning - part IIMikko Mäkipää
Introduction to Reinforcement Learning, part II: Basic tabular methods
This is the second presentation in a three-part series covering the basics of Reinforcement Learning (RL).
In this presentation, we introduce some more building blocks, such as policy iteration, bandits and exploration, epsilon-greedy policies, temporal difference methods.
We introduce basic model-free methods that use tabular value representation; Monte Carlo on- and off-policy, Sarsa, Expected Sarsa, and Q-learning.
The algorithms are illustrated using simple black jack as an environment.
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
This document summarizes an efficient use of temporal difference techniques in computer game learning. It discusses reinforcement learning and some key concepts including the agent-environment interface, types of reinforcement learning tasks, elements of reinforcement learning like policy, reward functions, and value functions. It also describes algorithms like dynamic programming, policy iteration, value iteration, and temporal difference learning. Finally, it mentions some applications of reinforcement learning in benchmark problems, games, and real-world domains like robotics and control.
Reinforcement learning is a machine learning technique that involves an agent learning how to achieve a goal in an environment by trial-and-error using feedback in the form of rewards and punishments. The agent learns an optimal behavior or policy for achieving the maximum reward. Key elements of reinforcement learning include the agent, environment, states, actions, policy, reward function, and value function. Reinforcement learning problems can be solved using methods like dynamic programming, Monte Carlo methods, and temporal difference learning.
How to formulate reinforcement learning in illustrative waysYasutoTamura1
This lecture introduces reinforcement learning and how to approach learning it. It discusses formulating the environment as a Markov decision process and defines important concepts like policy, value functions, returns, and the Bellman equation. The key ideas are that reinforcement learning involves optimizing a policy to maximize expected returns, and value functions are introduced to indirectly evaluate and improve the policy through dynamic programming methods like policy iteration and value iteration. Understanding these fundamental concepts through simple examples is emphasized as the starting point for learning reinforcement learning.
Reinforcement learning is a method for learning optimal actions by maximizing rewards from the environment over time. The goal is to learn a policy that chooses actions to maximize the discounted sum of future rewards. This can be modeled as a Markov decision process where the goal is to learn the optimal policy or value function. Q-learning is an algorithm that can learn the optimal policy even when the environment dynamics are unknown, by learning the Q function which estimates the expected discounted reward for state-action pairs. The Q function is learned online through trial-and-error interactions with the environment using a recursive training rule.
This document provides an overview of reinforcement learning and some key algorithms used in artificial intelligence. It introduces reinforcement learning concepts like Markov decision processes, value functions, temporal difference learning methods like Q-learning and SARSA, and policy gradient methods. It also describes deep reinforcement learning techniques like deep Q-networks that combine reinforcement learning with deep neural networks. Deep Q-networks use experience replay and fixed length state representations to allow deep neural networks to approximate the Q-function and learn successful policies from high dimensional input like images.
this talk was an introduction to Reinforcement Learning based on the book by Andrew Barto and Richard S. Sutton. We explained the main components of an RL problem and detailed the tabular solutions and approximate solutions methods.
1. Reinforcement learning involves an agent learning through trial-and-error interactions with an environment. The agent learns a policy for how to act by maximizing rewards.
2. The document outlines key elements of reinforcement learning including states, actions, rewards, value functions, and explores different methods for solving reinforcement learning problems including dynamic programming, Monte Carlo methods, and temporal difference learning.
3. Temporal difference learning combines the advantages of Monte Carlo methods and dynamic programming by allowing for incremental learning through bootstrapping predictions like dynamic programming while also learning directly from experience like Monte Carlo methods.
Intro to Reinforcement learning - part IMikko Mäkipää
Introduction to Reinforcement Learning, part I: Dynamic programming
This is the first presentation in a three-part series covering the basics of Reinforcement Learning (RL).
In this presentation, we introduce reinforcement learning as a machine learning approach. We cover the terminology and building blocks needed, such as agents and environments, policies and value functions, Markov Decision Processes.
We introduce two basic dynamic programming algorithms; Value iteration and Policy iteration, and illustrate the algorithms using a simple (canonical) maze as an example.
Here are the key steps to run a REINFORCE algorithm on the CartPole environment using SLM Lab:
1. Define the REINFORCE agent configuration in a spec file. This specifies things like the algorithm name, hyperparameters, network architecture, optimizer, etc.
2. Define the CartPole environment configuration.
3. Initialize SLM Lab and load the spec file:
```js
const slmLab = require('slm-lab');
slmLab.init();
const spec = require('./reinforce_cartpole.js');
```
4. Create an experiment with the spec:
```js
const experiment = new slmLab.Experiment(spec
The document summarizes key concepts in reinforcement learning:
- Agent-environment interaction is modeled as states, actions, and rewards
- A policy is a rule for selecting actions in each state
- The return is the total discounted future reward an agent aims to maximize
- Tasks can be episodic or continuing
- The Markov property means the future depends only on the present state
- The agent-environment framework can be modeled as a Markov decision process
This document summarizes an article about planning and acting in partially observable stochastic domains. It introduces the concepts of partially observable Markov decision processes (POMDPs) which involve uncertainty about the system state. Key points covered include:
- POMDPs extend Markov decision processes to include uncertainty about the true state due to partial observability
- The agent's uncertainty can be represented by a "belief state" which is a probability distribution over possible world states
- Finding optimal policies in POMDPs involves reasoning about belief state transitions in response to actions and observations
- Exact solutions can be computed using a witness algorithm that iteratively constructs and prunes policy trees representing the value function
The document discusses challenges in reinforcement learning. It defines reinforcement learning as combining aspects of supervised and unsupervised learning, using sparse, time-delayed rewards to learn optimal behavior. The two main challenges are the credit assignment problem of determining which actions led to rewards, and balancing exploration of new actions with exploitation of existing knowledge. Q-learning is introduced as a way to estimate state-action values to learn optimal policies, and deep Q-networks are proposed to approximate Q-functions using neural networks for large state spaces. Experience replay and epsilon-greedy exploration are also summarized as techniques to improve deep Q-learning performance and exploration.
This document discusses reinforcement learning, an approach to machine learning where an agent learns behaviors through trial and error interactions with its environment. The agent receives positive or negative feedback based on its actions, allowing it to maximize rewards. Specifically:
1) In reinforcement learning, an agent performs actions in an environment and receives feedback in the form of rewards or punishments to learn behaviors without a teacher directly telling it what to do.
2) The goal is for the agent to learn a policy to map states to actions that will maximize total rewards. It must figure out which of its past actions led to rewards through the "credit assignment problem."
3) Reinforcement learning has been applied to problems like game playing, robot control
This document provides a summary of sampling-based approximations for reinforcement learning. It discusses using samples to approximate value iteration, policy iteration, and Q-learning when the state-action space is too large to store a table of values. Key points covered include using Q-learning with function approximation instead of a table, using features to generalize Q-values across states, and examples of feature representations like those used for the Tetris domain. Convergence properties of approximate Q-learning are also discussed.
Reinforcement Learning Guide For Beginnersgokulprasath06
Reinforcement Learning Guide:
Land in multiple job interviews by joining our Data Science certification course.
Data Science course content designed uniquely, which helps you start learning Data Science from basics to advanced data science concepts.
Content: http://bit.ly/2Mub6xP
Any Queries, Call us@ +91 9884412301 / 9600112302
Reinforcement learning allows an agent to learn from interaction with an uncertain environment to achieve a goal. There are three main methods to solve reinforcement learning problems: dynamic programming which requires a complete model of the environment; Monte Carlo methods which learn from sample episodes without a model; and temporal-difference learning like Sarsa and Q-learning which combine ideas from dynamic programming and Monte Carlo to learn directly from experience in an online manner. Designing good state representations, features, and rewards is important for applying these methods to real-world problems.
Modern Recommendation for Advanced Practitioners part2Flavian Vasile
This document summarizes a section on policy learning approaches for recommendation systems. It begins by contrasting policy-based models with value-based models, noting that policy models directly learn a mapping from user states to actions rather than computing value estimates for all actions.
It then introduces concepts in contextual bandits and reinforcement learning, noting that contextual bandits are often a better fit for recommendations since recommendations typically have independent effects. It also discusses using counterfactual risk minimization to address covariate shift in policy learning models by reweighting training data based on logging and target policies.
Finally, it proposes two formulations for contextual bandit models for recommendations - one that directly optimizes a clipped importance sampling objective, and one that optimizes
Reinforcement learning is a machine learning technique where an agent learns how to behave in an environment by receiving rewards or punishments for its actions. The goal of the agent is to learn an optimal policy that maximizes long-term rewards. Reinforcement learning can be applied to problems like game playing, robot control, scheduling, and economic modeling. The reinforcement learning process involves an agent interacting with an environment to learn through trial-and-error using state, action, reward, and policy. Common algorithms include Q-learning which uses a Q-table to learn the optimal action-selection policy.
This presentation discusses Markov decision processes (MDPs) for solving sequential decision problems under uncertainty. An MDP is defined by a tuple containing states, actions, transition probabilities, and rewards. The objective is to find an optimal policy that maximizes expected long-term rewards by choosing the best sequence of actions. Value iteration is introduced as an algorithm for computing optimal policies by iteratively updating the value of each state. The presentation also discusses MDP terminology, stationary policies, influence diagrams, and methods for solving large MDP problems incrementally using decision trees.
Reinforcement learning involves an agent learning how to behave through trial-and-error interactions with an environment. The agent receives rewards or punishments without being told the correct actions. There are two main types of learning - supervised learning where examples are provided, and reinforcement learning where only evaluations are provided. Reinforcement learning can be modeled as a Markov decision process and approached through model-based methods which learn the environment model, or model-free methods like temporal difference learning which learn directly from experiences. Active learning requires an agent to consider the impact of actions on both immediate and long-term rewards. Exploration strategies balance exploiting current knowledge with exploring unknown areas. Generalization techniques like function approximation can help scale reinforcement learning to large problems.
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
This document discusses deep reinforcement learning and concept network reinforcement learning. It begins with an introduction to reinforcement learning concepts like Markov decision processes and value-based methods. It then describes Concept-Network Reinforcement Learning which decomposes complex tasks into high-level concepts or actions. This allows composing existing solutions to sub-problems without retraining. The document provides examples of using concept networks for lunar lander and robot pick-and-place tasks. It concludes by discussing how concept networks can improve sample efficiency, especially for sparse reward problems.
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...Lviv Startup Club
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic pricing with RL
AI & BigData Online Day 2021
Website - http://aiconf.com.ua
Youtube - https://www.youtube.com/startuplviv
FB - https://www.facebook.com/aiconf
Reinforcement learning techniques allow an agent to learn optimal behavior through trial-and-error interactions with its environment. The document discusses passive reinforcement learning where a fixed policy is followed to receive rewards. It also covers temporal difference learning which uses observed transitions to update state values according to temporal differences. Active reinforcement learning requires balancing exploration of new actions with exploitation of current knowledge to learn the optimal policy.
HTML and Web Pages discusses HTML, the language that drives web pages. It describes HTML tags that can be used to markup different elements of a page, such as headings, paragraphs, lists, and formatting. The document also provides instructions for editing HTML files, publishing web pages on a server, and examples of basic HTML code structure and tags.
This document provides an overview of databases and database management systems (DBMS). It discusses the history of databases, from early file-based systems to hierarchical, network, and relational models. Key topics covered include the definition of a database, components of a DBMS like SQL and data dictionaries, the roles involved in database administration, and advantages/limitations of DBMS. The document concludes with an assignment asking students to review the chapter, read an appendix, and submit a group list.
This document provides an overview of reinforcement learning and some key algorithms used in artificial intelligence. It introduces reinforcement learning concepts like Markov decision processes, value functions, temporal difference learning methods like Q-learning and SARSA, and policy gradient methods. It also describes deep reinforcement learning techniques like deep Q-networks that combine reinforcement learning with deep neural networks. Deep Q-networks use experience replay and fixed length state representations to allow deep neural networks to approximate the Q-function and learn successful policies from high dimensional input like images.
this talk was an introduction to Reinforcement Learning based on the book by Andrew Barto and Richard S. Sutton. We explained the main components of an RL problem and detailed the tabular solutions and approximate solutions methods.
1. Reinforcement learning involves an agent learning through trial-and-error interactions with an environment. The agent learns a policy for how to act by maximizing rewards.
2. The document outlines key elements of reinforcement learning including states, actions, rewards, value functions, and explores different methods for solving reinforcement learning problems including dynamic programming, Monte Carlo methods, and temporal difference learning.
3. Temporal difference learning combines the advantages of Monte Carlo methods and dynamic programming by allowing for incremental learning through bootstrapping predictions like dynamic programming while also learning directly from experience like Monte Carlo methods.
Intro to Reinforcement learning - part IMikko Mäkipää
Introduction to Reinforcement Learning, part I: Dynamic programming
This is the first presentation in a three-part series covering the basics of Reinforcement Learning (RL).
In this presentation, we introduce reinforcement learning as a machine learning approach. We cover the terminology and building blocks needed, such as agents and environments, policies and value functions, Markov Decision Processes.
We introduce two basic dynamic programming algorithms; Value iteration and Policy iteration, and illustrate the algorithms using a simple (canonical) maze as an example.
Here are the key steps to run a REINFORCE algorithm on the CartPole environment using SLM Lab:
1. Define the REINFORCE agent configuration in a spec file. This specifies things like the algorithm name, hyperparameters, network architecture, optimizer, etc.
2. Define the CartPole environment configuration.
3. Initialize SLM Lab and load the spec file:
```js
const slmLab = require('slm-lab');
slmLab.init();
const spec = require('./reinforce_cartpole.js');
```
4. Create an experiment with the spec:
```js
const experiment = new slmLab.Experiment(spec
The document summarizes key concepts in reinforcement learning:
- Agent-environment interaction is modeled as states, actions, and rewards
- A policy is a rule for selecting actions in each state
- The return is the total discounted future reward an agent aims to maximize
- Tasks can be episodic or continuing
- The Markov property means the future depends only on the present state
- The agent-environment framework can be modeled as a Markov decision process
This document summarizes an article about planning and acting in partially observable stochastic domains. It introduces the concepts of partially observable Markov decision processes (POMDPs) which involve uncertainty about the system state. Key points covered include:
- POMDPs extend Markov decision processes to include uncertainty about the true state due to partial observability
- The agent's uncertainty can be represented by a "belief state" which is a probability distribution over possible world states
- Finding optimal policies in POMDPs involves reasoning about belief state transitions in response to actions and observations
- Exact solutions can be computed using a witness algorithm that iteratively constructs and prunes policy trees representing the value function
The document discusses challenges in reinforcement learning. It defines reinforcement learning as combining aspects of supervised and unsupervised learning, using sparse, time-delayed rewards to learn optimal behavior. The two main challenges are the credit assignment problem of determining which actions led to rewards, and balancing exploration of new actions with exploitation of existing knowledge. Q-learning is introduced as a way to estimate state-action values to learn optimal policies, and deep Q-networks are proposed to approximate Q-functions using neural networks for large state spaces. Experience replay and epsilon-greedy exploration are also summarized as techniques to improve deep Q-learning performance and exploration.
This document discusses reinforcement learning, an approach to machine learning where an agent learns behaviors through trial and error interactions with its environment. The agent receives positive or negative feedback based on its actions, allowing it to maximize rewards. Specifically:
1) In reinforcement learning, an agent performs actions in an environment and receives feedback in the form of rewards or punishments to learn behaviors without a teacher directly telling it what to do.
2) The goal is for the agent to learn a policy to map states to actions that will maximize total rewards. It must figure out which of its past actions led to rewards through the "credit assignment problem."
3) Reinforcement learning has been applied to problems like game playing, robot control
This document provides a summary of sampling-based approximations for reinforcement learning. It discusses using samples to approximate value iteration, policy iteration, and Q-learning when the state-action space is too large to store a table of values. Key points covered include using Q-learning with function approximation instead of a table, using features to generalize Q-values across states, and examples of feature representations like those used for the Tetris domain. Convergence properties of approximate Q-learning are also discussed.
Reinforcement Learning Guide For Beginnersgokulprasath06
Reinforcement Learning Guide:
Land in multiple job interviews by joining our Data Science certification course.
Data Science course content designed uniquely, which helps you start learning Data Science from basics to advanced data science concepts.
Content: http://bit.ly/2Mub6xP
Any Queries, Call us@ +91 9884412301 / 9600112302
Reinforcement learning allows an agent to learn from interaction with an uncertain environment to achieve a goal. There are three main methods to solve reinforcement learning problems: dynamic programming which requires a complete model of the environment; Monte Carlo methods which learn from sample episodes without a model; and temporal-difference learning like Sarsa and Q-learning which combine ideas from dynamic programming and Monte Carlo to learn directly from experience in an online manner. Designing good state representations, features, and rewards is important for applying these methods to real-world problems.
Modern Recommendation for Advanced Practitioners part2Flavian Vasile
This document summarizes a section on policy learning approaches for recommendation systems. It begins by contrasting policy-based models with value-based models, noting that policy models directly learn a mapping from user states to actions rather than computing value estimates for all actions.
It then introduces concepts in contextual bandits and reinforcement learning, noting that contextual bandits are often a better fit for recommendations since recommendations typically have independent effects. It also discusses using counterfactual risk minimization to address covariate shift in policy learning models by reweighting training data based on logging and target policies.
Finally, it proposes two formulations for contextual bandit models for recommendations - one that directly optimizes a clipped importance sampling objective, and one that optimizes
Reinforcement learning is a machine learning technique where an agent learns how to behave in an environment by receiving rewards or punishments for its actions. The goal of the agent is to learn an optimal policy that maximizes long-term rewards. Reinforcement learning can be applied to problems like game playing, robot control, scheduling, and economic modeling. The reinforcement learning process involves an agent interacting with an environment to learn through trial-and-error using state, action, reward, and policy. Common algorithms include Q-learning which uses a Q-table to learn the optimal action-selection policy.
This presentation discusses Markov decision processes (MDPs) for solving sequential decision problems under uncertainty. An MDP is defined by a tuple containing states, actions, transition probabilities, and rewards. The objective is to find an optimal policy that maximizes expected long-term rewards by choosing the best sequence of actions. Value iteration is introduced as an algorithm for computing optimal policies by iteratively updating the value of each state. The presentation also discusses MDP terminology, stationary policies, influence diagrams, and methods for solving large MDP problems incrementally using decision trees.
Reinforcement learning involves an agent learning how to behave through trial-and-error interactions with an environment. The agent receives rewards or punishments without being told the correct actions. There are two main types of learning - supervised learning where examples are provided, and reinforcement learning where only evaluations are provided. Reinforcement learning can be modeled as a Markov decision process and approached through model-based methods which learn the environment model, or model-free methods like temporal difference learning which learn directly from experiences. Active learning requires an agent to consider the impact of actions on both immediate and long-term rewards. Exploration strategies balance exploiting current knowledge with exploring unknown areas. Generalization techniques like function approximation can help scale reinforcement learning to large problems.
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
This document discusses deep reinforcement learning and concept network reinforcement learning. It begins with an introduction to reinforcement learning concepts like Markov decision processes and value-based methods. It then describes Concept-Network Reinforcement Learning which decomposes complex tasks into high-level concepts or actions. This allows composing existing solutions to sub-problems without retraining. The document provides examples of using concept networks for lunar lander and robot pick-and-place tasks. It concludes by discussing how concept networks can improve sample efficiency, especially for sparse reward problems.
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...Lviv Startup Club
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic pricing with RL
AI & BigData Online Day 2021
Website - http://aiconf.com.ua
Youtube - https://www.youtube.com/startuplviv
FB - https://www.facebook.com/aiconf
Reinforcement learning techniques allow an agent to learn optimal behavior through trial-and-error interactions with its environment. The document discusses passive reinforcement learning where a fixed policy is followed to receive rewards. It also covers temporal difference learning which uses observed transitions to update state values according to temporal differences. Active reinforcement learning requires balancing exploration of new actions with exploitation of current knowledge to learn the optimal policy.
HTML and Web Pages discusses HTML, the language that drives web pages. It describes HTML tags that can be used to markup different elements of a page, such as headings, paragraphs, lists, and formatting. The document also provides instructions for editing HTML files, publishing web pages on a server, and examples of basic HTML code structure and tags.
This document provides an overview of databases and database management systems (DBMS). It discusses the history of databases, from early file-based systems to hierarchical, network, and relational models. Key topics covered include the definition of a database, components of a DBMS like SQL and data dictionaries, the roles involved in database administration, and advantages/limitations of DBMS. The document concludes with an assignment asking students to review the chapter, read an appendix, and submit a group list.
This document provides an introduction to SQL (Structured Query Language). It discusses:
1. SQL is the standard language for querying and manipulating data in relational databases. There are various SQL standards like ANSI SQL, SQL92, and SQL99. Vendors support different subsets.
2. SQL has three main components: DDL for defining schema, DML for manipulating data, and queries. Basic queries use SELECT, FROM, and WHERE clauses to retrieve data from one or more tables.
3. Tables have a schema defining attributes/columns. Tuples/rows store the data values. Keys uniquely identify rows. Data types define valid values for attributes.
4. Simple queries can select,
Databases allow for the storage and organization of related data. A database contains tables that store data in rows and columns. A database management system (DBMS) helps define, construct, and manipulate the database. Relational databases follow a relational model and store data in related tables. Benefits of databases over file systems include reduced data redundancy, avoidance of data inconsistency, ability to share data among multiple users, and application of security restrictions. Transactions allow multiple database operations to be executed atomically as a single unit.
This document discusses different data types used in programming languages. It covers primitive data types implemented directly in hardware like integers and floating point numbers. It also discusses common composite types like strings, arrays, records, and enumerations. The document examines how various programming languages have implemented and extended these basic types over time to provide more powerful abstractions.
The document discusses various data types including primitive data types, strings, enumeration types, subrange types, arrays, associative arrays, record types, union types, and pointer/reference types. It provides examples of these types in languages like C++, Java, Perl, COBOL, Ada, and discusses concepts like weak typing, file I/O, memory representation, and pointer problems. The lecture outline covers these data type topics at a high level for Chapter 6.
This document provides an overview of NoSQL databases. It discusses that NoSQL databases are non-relational and do not follow the RDBMS principles. It describes some of the main types of NoSQL databases including document stores, key-value stores, column-oriented stores, and graph databases. It also discusses how NoSQL databases are designed for massive scalability and do not guarantee ACID properties, instead following a BASE model ofBasically Available, Soft state, and Eventually Consistent.
This document provides an overview and summary of a lecture on NoSQL databases. It begins by classifying different types of data and discussing how data is typically scaled and replicated in traditional databases. It then introduces the CAP theorem and how it relates to consistency, availability, and partition tolerance. The document explains how large-scale databases adopt the BASE properties of eventual consistency in order to guarantee high availability. Finally, it provides a brief overview of different types of NoSQL databases such as document stores, graph databases, key-value stores, and columnar databases.
This document provides an introduction to text classification and different methods for categorizing documents. It discusses how standing queries can be viewed as text classifiers and how spam filtering is another example of text classification. Common classification methods include manual classification, rule-based classifiers, and supervised learning algorithms like Naive Bayes, k-Nearest Neighbors, and Support Vector Machines. Feature selection and evaluating classifier performance are also covered.
HTML and Web Pages discusses HTML, the language that drives web pages. It describes HTML elements like tags, attributes, and basic page structure using the html, head, title, and body elements. The document also provides instructions on editing HTML files, publishing web pages on a server, and introduces common HTML tags for formatting text and creating lists and links.
Keras and TensorFlow are popular Python packages for deep learning. Keras provides high-level neural network APIs that allow users to build and train models quickly with fewer lines of code, running on CPU or GPU. TensorFlow is a more flexible platform that supports lower-level operations and multi-platform deployment, but requires more code. Both packages were used in examples to build and train a basic neural network model to solve the XOR problem, demonstrating their capabilities for machine learning tasks.
This document provides an introduction and overview of OpenCV, an open source image processing library. It discusses OpenCV's core data structures like Point, Size, Rect, and Mat. It also covers basic OpenCV functions and concepts like image I/O, drawing, thresholding, edge detection, and linear algebra operations. The document recommends starting with simple OpenCV code like loading and displaying an image before moving on to more advanced topics.
This document summarizes a lecture on computational tools for image processing. It introduces MATLAB as a powerful tool for image processing and manipulation. It covers the basics of MATLAB including image input/output, matrix manipulation, and functions in the image processing toolbox. It also discusses common image file formats and data types, as well as resources for learning more about image processing and MATLAB.
This document contains notes from a computer science course on Prolog. It discusses several built-in predicates in Prolog for testing term types, such as var/1, atom/1, integer/1, etc. It also covers topics like dynamic predicate creation, term decomposition and construction, database manipulation, and predicates for handling sets of solutions like bagof/3, setof/3 and findall/3. Examples are provided for solving cryptarithmetic puzzles and performing term substitution.
This document provides an introduction to object detection using OpenCV. It discusses how simple objects can be detected by extracting image features like edges and colors, while more complex objects require learning-based methods like AdaBoost. The document explains how AdaBoost uses positive and negative samples to build a statistical model and compress distinctive features. It also gives an example of Haar-like features and provides steps to build an object detector using OpenCV tools.
The document discusses various methods for evaluating machine learning models and comparing their performance. It covers metrics like accuracy, precision, recall, cost matrices, and ROC curves. Key methods discussed include holdout validation, k-fold cross validation, and the bootstrap method for obtaining reliable performance estimates. It also addresses issues like class imbalance and overfitting. ROC curves and the area under the ROC curve are presented as ways to visually and quantitatively compare models.
This document provides an overview of inferential statistics, including key terminology like population, sample, parameter, statistic, and estimate. It discusses the central limit theorem and how the sampling distribution of means becomes normal for large sample sizes. The document covers point estimation and interval estimation using confidence intervals. It explains how to construct confidence intervals by adding and subtracting values like the standard error of the mean from the sample statistic. The level of confidence, like 95% or 99%, determines how wide the interval needs to be to capture the population parameter.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
2. Reinforcement Learning
Basic idea:
Receive feedback in the form of rewards
Agent’s utility is defined by the reward function
Must (learn to) act so as to maximize expected rewards
3. Grid World
The agent lives in a grid
Walls block the agent’s path
The agent’s actions do not always
go as planned:
80% of the time, the action North
takes the agent North
(if there is no wall there)
10% of the time, North takes the
agent West; 10% East
If there is a wall in the direction the
agent would have been taken, the
agent stays put
Small “living” reward each step
Big rewards come at the end
Goal: maximize sum of rewards*
5. Markov Decision Processes
An MDP is defined by:
A set of states s S
A set of actions a A
A transition function T(s,a,s’)
Prob that a from s leads to s’
i.e., P(s’ | s,a)
Also called the model
A reward function R(s, a, s’)
Sometimes just R(s) or R(s’)
A start state (or distribution)
Maybe a terminal state
MDPs are a family of non-
deterministic search problems
Reinforcement learning: MDPs
where we don’t know the
transition or reward functions
5
7. What is Markov about MDPs?
Andrey Markov (1856-1922)
“Markov” generally means that given
the present state, the future and the
past are independent
For Markov decision processes,
“Markov” means:
8. Solving MDPs
In deterministic single-agent search problems, want an
optimal plan, or sequence of actions, from start to a goal
In an MDP, we want an optimal policy *: S → A
A policy gives an action for each state
An optimal policy maximizes expected utility if followed
Defines a reflex agent
Optimal policy when
R(s, a, s’) = -0.03 for all
non-terminals s
10. MDP Search Trees
Each MDP state gives an expectimax-like search tree
a
10
s
s’
s, a
(s,a,s’) called a transition
T(s,a,s’) = P(s’|s,a)
R(s,a,s’)
s,a,s’
s is a state
(s, a) is a
q-state
11. Utilities of Sequences
In order to formalize optimality of a policy, need to
understand utilities of sequences of rewards
Typically consider stationary preferences:
Theorem: only two ways to define stationary utilities
Additive utility:
Discounted utility:
11
12. Infinite Utilities?!
Problem: infinite state sequences have infinite rewards
Solutions:
Finite horizon:
Terminate episodes after a fixed T steps (e.g. life)
Gives nonstationary policies ( depends on time left)
Absorbing state: guarantee that for every policy, a terminal state
will eventually be reached
Discounting: for 0 < < 1
Smaller means smaller “horizon” – shorter term focus
12
14. Recap: Defining MDPs
Markov decision processes:
States S
0
Start state s
Actions A
Transitions P(s’|s,a) (or T(s,a,s’))
Rewards R(s,a,s’) (and discount )
MDP quantities so far:
Policy = Choice of action for each state
Utility (or return) = sum of discounted rewards
s
a
s, a
s,a,s’
s’
14
15. Optimal Utilities
Fundamental operation: compute
the values (optimal expectimax
utilities) of states s
Why? Optimal values define
optimal policies!
Define the value of a state s:
V*(s) = expected utility starting in s
and acting optimally
Define the value of a q-state (s,a):
Q*(s,a) = expected utility starting in s,
taking action a and thereafter
acting optimally
Define the optimal policy:
*(s) = optimal action from state s
s
a
s, a
s,a,s’
s’
15
16. The Bellman Equations
Definition of “optimal utility” leads to a
simple one-step lookahead relationship
amongst optimal utility values:
Optimal rewards = maximize over first
action and then follow optimal policy
Formally:
s
a
s, a
s,a,s’
s’
16
17. Solving MDPs
We want to find the optimal policy *
Proposal 1: modified expectimax search, starting from
each state s:
s
a
s, a
s,a,s’
s’
17
18. Why Not Search Trees?
Why not solve with expectimax?
Problems:
This tree is usually infinite (why?)
Same states appear over and over (why?)
We would search once per state (why?)
Idea: Value iteration
Compute optimal values for all states all at
once using successive approximations
Will be a bottom-up dynamic program
similar in cost to memoization
Do all planning offline, no replanning
needed!
18
19. Value Estimates
Calculate estimates V *(s)
k
Not the optimal value of s!
The optimal value
considering only next k
time steps (k rewards)
As k , it approaches
the optimal value
Almost solution: recursion
(i.e. expectimax)
Correct solution: dynamic
programming
19
20. Value Iteration
Idea:
Start with V *(s) = 0, which we know is right (why?)
0
Given Vi
*, calculate the values for all states for depth i+1:
This is called a value update or Bellman update
Repeat until convergence
Theorem: will converge to unique optimal values
Basic idea: approximations get refined towards optimal values
Policy may converge long before values do
20
22. Example: Value Iteration
Information propagates outward from terminal
states and eventually all states have correct
value estimates
V2 V3
22
23. Convergence*
Define the max-norm:
Theorem: For any two approximations U and V
I.e. any distinct approximations must get closer to each other,
so, in particular, any approximation must get closer to the true U
and value iteration converges to a unique, stable, optimal
solution
Theorem:
I.e. once the change in our approximation is small, it must also
be close to correct
23
24. Practice: Computing Actions
Which action should we chose from state s:
Given optimal values V?
Given optimal q-values Q?
Lesson: actions are easier to select from Q’s!
24
25. Utilities for Fixed Policies
Another basic operation: compute
the utility of a state s under a fix
(general non-optimal) policy
Define the utility of a state s, under a
fixed policy :
V(s) = expected total discounted
rewards (return) starting in s and
following
Recursive relation (one-step look-
ahead / Bellman equation):
s
(s)
s, (s)
s, (s),s’
s’
26
26. Value Iteration
Idea:
Start with V *(s) = 0, which we know is right (why?)
0
Given Vi
*, calculate the values for all states for depth i+1:
This is called a value update or Bellman update
Repeat until convergence
Theorem: will converge to unique optimal values
Basic idea: approximations get refined towards optimal values
Policy may converge long before values do
27
27. Policy Iteration
29
Problem with value iteration:
Considering all actions each iteration is slow: takes |A| times longer
than policy evaluation
But policy doesn’t change each iteration, time wasted
Alternative to value iteration:
Step 1: Policy evaluation: calculate utilities for a fixed policy (not optimal
utilities!) until convergence (fast)
Step 2: Policy improvement: update policy using one-step lookahead
with resulting converged (but not optimal!) utilities (slow but infrequent)
Repeat steps until policy converges
This is policy iteration
It’s still optimal!
Can converge faster under some conditions
28. Policy Iteration
Policy evaluation: with fixed current policy , find values
with simplified Bellman updates:
Iterate until values converge
Policy improvement: with fixed utilities, find the best
action according to one-step look-ahead
30
29. Comparison
31
In value iteration:
Every pass (or “backup”) updates both utilities (explicitly, based
on current utilities) and policy (possibly implicitly, based on
current policy)
In policy iteration:
Several passes to update utilities with frozen policy
Occasional passes to update policies
Hybrid approaches (asynchronous policy iteration):
Any sequences of partial updates to either policy entries or
utilities will converge if every state is visited infinitely often
30. Reinforcement Learning
36
Reinforcement learning:
Still assume an MDP:
A set of states s S
A set of actions (per state) A
A model T(s,a,s’)
A reward function R(s,a,s’)
Still looking for a policy (s)
New twist: don’t know T or R
i.e. don’t know which states are good or what the actions do
Must actually try actions and states out to learn
31. Passive Learning
Simplified task
You don’t know the transitions T(s,a,s’)
You don’t know the rewards R(s,a,s’)
You are given a policy (s)
Goal: learn the state values
… what policy evaluation did
In this case:
Learner “along for the ride”
No choice about what actions to take
Just execute the policy and learn from experience
We’ll get to the active case soon
This is NOT offline planning! You actually take actions in the
world and see what happens…
37
32. Example: Direct Evaluation
Episodes:
(1,1) up -1
(1,2) up -1
(1,2) up -1
(1,3) right -1
(2,3) right -1
(3,3) right -1
(3,2) up -1
(3,3) right -1
(4,3) exit +100
(done)
x
y
(1,1) up -1
(1,2) up -1
(1,3) right -1
(2,3) right -1
(3,3) right -1
(3,2) up -1
(4,2) exit -100
(done)
V(2,3) ~ (96 + -103) / 2 = -3.5
V(3,3) ~ (99 + 97 + -102) / 3 = 31.3
= 1, R = -1
+100
-100
38
33. Recap: Model-Based Policy Evaluation
Simplified Bellman updates to
calculate V for a fixed policy:
New V is expected one-step-look-
ahead using current V
Unfortunately, need T and R
s
(s)
s, (s)
s, (s),s’
s’
39
34. Model-Based Learning
Idea:
Learn the model empirically through experience
Solve for values as if the learned model were correct
Simple empirical model learning
Count outcomes for each s,a
Normalize to give estimate of T(s,a,s’)
Discover R(s,a,s’) when we experience (s,a,s’)
Solving the MDP with the learned model
Iterative policy evaluation, for example
s
(s)
s, (s)
s, (s),s’
s’
40
35. Example: Model-Based Learning
Episodes:
(1,1) up -1
(1,2) up -1
(1,2) up -1
(1,3) right -1
(2,3) right -1
(3,3) right -1
(3,2) up -1
(3,3) right -1
(4,3) exit +100
(done)
x
y
T(<3,3>, right, <4,3>) = 1 / 3
T(<2,3>, right, <3,3>) = 2 / 2
+100
-100
41
= 1
(1,1) up -1
(1,2) up -1
(1,3) right -1
(2,3) right -1
(3,3) right -1
(3,2) up -1
(4,2) exit -100
(done)
36. Model-Free Learning
Want to compute an expectation weighted by P(x):
Model-based: estimate P(x) from samples, compute expectation
Model-free: estimate expectation directly from samples
Why does this work? Because samples appear with the right
frequencies!
42
37. Sample-Based Policy Evaluation?
Who needs T and R? Approximate the
expectation with samples (drawn from T!)
s
(s)
s, (s)
s, (s),s’
s2’ s3’
s1
’’
43
Almost! But we only
actually make progress
when we move to i+1.
38. Temporal-Difference Learning
Big idea: learn from every experience!
Update V(s) each time we experience (s,a,s’,r)
Likely s’ will contribute updates more often
Temporal difference learning
Policy still fixed!
Move values toward value of whatever
successor occurs: running average!
s
(s)
s, (s)
s’
Sample of V(s):
Update to V(s):
Same update:
44
39. Exponential Moving Average
Exponential moving average
Makes recent samples more important
Forgets about the past (distant past values were wrong anyway)
Easy to compute from the running average
Decreasing learning rate can give converging averages
45
40. Example: TD Policy Evaluation
T
ake = 1, = 0.5
(1,1) up -1
(1,2) up -1
(1,2) up -1
(1,3) right -1
(2,3) right -1
(3,3) right -1
(3,2) up -1
(3,3) right -1
(4,3) exit +100
(done)
46
(1,1) up -1
(1,2) up -1
(1,3) right -1
(2,3) right -1
(3,3) right -1
(3,2) up -1
(4,2) exit -100
(done)
41. Problems with TD Value Learning
TD value leaning is a model-free way
to do policy evaluation
However, if we want to turn values into
a (new) policy, we’re sunk:
Idea: learn Q-values directly
Makes action selection model-free too!
s
a
s, a
s,a,s’
s’
47
42. Active Learning
Full reinforcement learning
You don’t know the transitions T(s,a,s’)
You don’t know the rewards R(s,a,s’)
You can choose any actions you like
Goal: learn the optimal policy
… what value iteration did!
In this case:
Learner makes choices!
Fundamental tradeoff: exploration vs. exploitation
This is NOT offline planning! You actually take actions in the
world and find out what happens…
48
43. The Story So Far: MDPs and RL
Compute V*, Q*, * exactly
Evaluate a fixed policy
If we don’t know the MDP
We can estimate the MDP then solve
We can estimate V for a fixed policy
We can estimate Q*(s,a) for the
optimal policy while executing an
exploration policy
Value and policy
Iteration
Policy evaluation
Things we know how to do: Techniques:
If we know the MDP Model-based DPs
Model-based RL
Model-free RL:
Value learning
Q-learning
49
44. Q-Learning
Q-Learning: sample-based Q-value iteration
Learn Q*(s,a) values
Receive a sample (s,a,s’,r)
Consider your old estimate:
Consider your new sample estimate:
Incorporate the new estimate into a running average:
52
45. Q-Learning Properties
Amazing result: Q-learning converges to optimal policy
If you explore enough
If you make the learning rate small enough
… but not decrease it too quickly!
Basically doesn’t matter how you select actions (!)
Neat property: off-policy learning
learn optimal policy without following it (some caveats)
S E S E
53
46. Exploration / Exploitation
Several schemes for forcing exploration
Simplest: random actions ( greedy)
Every time step, flip a coin
With probability , act randomly
With probability 1-, act according to current policy
Problems with random actions?
You do explore the space, but keep thrashing
around once learning is done
One solution: lower over time
Another solution: exploration functions
54
47. Exploration Functions
When to explore
Random actions: explore a fixed amount
Better idea: explore areas whose badness is not (yet)
established
Exploration function
Takes a value estimate and a count, and returns an optimistic
utility, e.g. (exact form not important)
55
49. Q-Learning
57
In realistic situations, we cannot possibly learn
about every single state!
Too many states to visit them all in training
Too many states to hold the q-tables in memory
Instead, we want to generalize:
Learn about some small number of training states
from experience
Generalize that experience to new, similar states
This is a fundamental idea in machine learning, and
we’ll see it over and over again
50. Example: Pacman
Let’s say we discover
through experience
that this state is bad:
In naïve q learning, we
know nothing about
this state or its q
states:
Or even this one!
58
51. Feature-Based Representations
Solution: describe a state using
a vector of features
Features are functions from states
to real numbers (often 0/1) that
capture important properties of the
state
Example features:
Distance to closest ghost
Distance to closest dot
Number of ghosts
1 / (dist to dot)2
Is Pacman in a tunnel? (0/1)
…… etc.
Can also describe a q-state (s, a)
with features (e.g. action moves
closer to food)
59
52. Linear Feature Functions
Using a feature representation, we can write a
q function (or value function) for any state
using a few weights:
Advantage: our experience is summed up in a
few powerful numbers
Disadvantage: states may share features but
be very different in value!
60
53. Function Approximation
Q-learning with linear q-functions:
Intuitive interpretation:
Adjust weights of active features
E.g. if something unexpectedly bad happens, disprefer all states
with that state’s features
Formal justification: online least squares
61
56. Policy Search
70
Problem: often the feature-based policies that work well
aren’t the ones that approximate V / Q best
E.g. your value functions from project 2 were probably horrible
estimates of future rewards, but they still produced good
decisions
We’ll see this distinction between modeling and prediction again
later in the course
Solution: learn the policy that maximizes rewards rather
than the value that predicts rewards
This is the idea behind policy search, such as what
controlled the upside-down helicopter
57. Policy Search
71
Simplest policy search:
Start with an initial linear value function or q-function
Nudge each feature weight up and down and see if
your policy is better than before
Problems:
How do we tell the policy got better?
Need to run many sample episodes!
If there are a lot of features, this can be impractical
58. Policy Search*
Advanced policy search:
Write a stochastic (soft) policy:
Turns out you can efficiently approximate the
derivative of the returns with respect to the
parameters w (details in the book, but you don’t have
to know them)
Take uphill steps, recalculate derivatives, etc.
72