The document outlines the author's work on reinforcement learning (RL). It begins with a one-page summary highlighting the author's contributions to prediction, control, and RL applications. It then provides background on Markov decision processes (MDPs), including definitions, value functions, policies, and dynamic programming methods. Finally, it discusses key aspects of RL like its sample-based learning from interactions and importance of sample and memory efficiency, covering popular algorithms like TD-learning and Q-learning.
Lucidata Titan geo-server data container and servicesGeoffrey Clark
a linux server, or docker container ready to help you automate data integration and visualization for supply chain, logistics and transportation freight data flows. Packed with market intelligence data, metadata, a geospatial data analysis portal with many templates, and supply chain web analytics. Let us help you wrangle your data and provide insights through visualization.
a linux server, or docker container ready to help you automate data integration and visualization for supply chain, logistics and transportation freight data flows. Packed with market intelligence data, metadata, a geospatial data analysis portal with many templates, and supply chain web analytics. Let us help you wrangle your data and provide insights through visualization.
This document discusses physical database design considerations for massively parallel processing (MPP) and columnar databases. It covers topics such as conceptual, logical, and physical data modeling; how business strategy and quantitative analysis are changing data needs; star schemas and dimensional modeling; columnar storage advantages for analytics; and analytic database technologies including MPP, Hadoop, and NoSQL databases. The key themes are how physical database design has evolved with technology improvements, and best practices for optimizing analytic database performance and integrating different data processing approaches.
George Hurrell was a photographer born in 1904 who studied at the Art Institute of Chicago. He became known for his innovative lighting techniques that helped make Hollywood stars like Clark Gable and Jean Harlow iconic. Hurrell used lighting, shadows, and high contrast to emphasize features and create drama, defining the look of Hollywood glamour portraits. His mastery of lighting techniques still inspires photographers today.
Lucidata Titan geo-server data container and servicesGeoffrey Clark
Titan geoserver is a server that deploys data and services to implement analytics and automate data integration. It is loaded with open source software like PostgreSQL, PostGIS and pgRouting as well as open datasets and optional analytics modules. Titan aims to standardize, automate and focus data, analytics and data science on freight and logistics to cut costs associated with data cleaning and integration. It can provide answers to business and technical questions for pricing analysts, data scientists and technical personnel.
SUSHMA Urban View is 3 BHK apartments projects. We completed this project successful without any issue and complaints. It is spread in 3.5 acres and strategically situated in Panchkula.
Outdoor spaces can be decorated to create a sense of calm and comfort where one can relax and spend time with loved ones. Simple changes like lighting, furniture like couches, and colorful cushions can brighten up an outdoor area. Swings are another design element that can be incorporated. When decorating outdoors, tips include playing with colors, combining designs with personal style, keeping it simple, and considering budget and space.
Lucidata Titan geo-server data container and servicesGeoffrey Clark
a linux server, or docker container ready to help you automate data integration and visualization for supply chain, logistics and transportation freight data flows. Packed with market intelligence data, metadata, a geospatial data analysis portal with many templates, and supply chain web analytics. Let us help you wrangle your data and provide insights through visualization.
a linux server, or docker container ready to help you automate data integration and visualization for supply chain, logistics and transportation freight data flows. Packed with market intelligence data, metadata, a geospatial data analysis portal with many templates, and supply chain web analytics. Let us help you wrangle your data and provide insights through visualization.
This document discusses physical database design considerations for massively parallel processing (MPP) and columnar databases. It covers topics such as conceptual, logical, and physical data modeling; how business strategy and quantitative analysis are changing data needs; star schemas and dimensional modeling; columnar storage advantages for analytics; and analytic database technologies including MPP, Hadoop, and NoSQL databases. The key themes are how physical database design has evolved with technology improvements, and best practices for optimizing analytic database performance and integrating different data processing approaches.
George Hurrell was a photographer born in 1904 who studied at the Art Institute of Chicago. He became known for his innovative lighting techniques that helped make Hollywood stars like Clark Gable and Jean Harlow iconic. Hurrell used lighting, shadows, and high contrast to emphasize features and create drama, defining the look of Hollywood glamour portraits. His mastery of lighting techniques still inspires photographers today.
Lucidata Titan geo-server data container and servicesGeoffrey Clark
Titan geoserver is a server that deploys data and services to implement analytics and automate data integration. It is loaded with open source software like PostgreSQL, PostGIS and pgRouting as well as open datasets and optional analytics modules. Titan aims to standardize, automate and focus data, analytics and data science on freight and logistics to cut costs associated with data cleaning and integration. It can provide answers to business and technical questions for pricing analysts, data scientists and technical personnel.
SUSHMA Urban View is 3 BHK apartments projects. We completed this project successful without any issue and complaints. It is spread in 3.5 acres and strategically situated in Panchkula.
Outdoor spaces can be decorated to create a sense of calm and comfort where one can relax and spend time with loved ones. Simple changes like lighting, furniture like couches, and colorful cushions can brighten up an outdoor area. Swings are another design element that can be incorporated. When decorating outdoors, tips include playing with colors, combining designs with personal style, keeping it simple, and considering budget and space.
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
This document discusses deep reinforcement learning through policy optimization. It begins with an introduction to reinforcement learning and how deep neural networks can be used to approximate policies, value functions, and models. It then discusses how deep reinforcement learning can be applied to problems in robotics, business operations, and other machine learning domains. The document reviews how reinforcement learning relates to other machine learning problems like supervised learning and contextual bandits. It provides an overview of policy gradient methods and the cross-entropy method for policy optimization before discussing Markov decision processes, parameterized policies, and specific policy gradient algorithms like the vanilla policy gradient algorithm and trust region policy optimization.
This document provides an introduction to R, an open-source statistical software environment. It discusses key features of R including its data handling, statistical, and graphical capabilities. The document also summarizes popular statistical methods that can be performed in R like correlation, regression, time series analysis, and discusses the advantages and disadvantages of using R.
The document discusses the Pi-calculus, a formal system for describing processes that communicate via channels. It provides an overview of the history and motivation for Pi-calculus, its basic elements like variables, channels, and processes, examples of how it can model processes and lambda calculus, and rules for reducing Pi-calculus expressions. Pi-calculus is used to describe concurrent and distributed systems formally and can describe processes that change over time through communication.
Algorithm of Dynamic Programming for Paper-Reviewer Assignment ProblemIRJET Journal
The document proposes an algorithm for dynamically assigning papers to reviewers based on keywords. It discusses:
1) Existing exact string matching algorithms like brute force, KMP, and Boyer-Moore are ineffective for this problem since keyword phrases may be similar but not exact matches.
2) The algorithm uses dynamic programming to calculate an "expertise distance" between each paper's keywords and a reviewer's keywords based on the edit distance between the strings. Reviewers with lower expertise distances are better matches.
3) This approach accounts for minor differences in keyword phrases while still capturing the underlying semantic similarity, making it more suitable than exact string matching for the paper-reviewer assignment problem.
Practical Reinforcement Learning with TensorFlowIllia Polosukhin
This document discusses reinforcement learning techniques for solving problems modeled as Markov decision processes using TensorFlow. It introduces the OpenAI Gym environment for testing RL algorithms, describes modeling problems as MDPs and key concepts like state-value functions and Q-learning. Deep Q-learning and policy gradient methods are explained for approximating value and policy functions with neural networks. Asynchronous advantage actor-critic (A3C) is presented as an effective approach and results are shown matching or beating human performance on Atari games. Practical applications of RL are identified in robotics, finance, optimization and predictive assistants.
The document discusses query compilation and optimization. It covers parsing a SQL query into a parse tree, converting the parse tree into a logical query plan using relational algebra, and estimating the costs of different logical and physical query plans to select the most efficient plan. The key steps are parsing, rewriting the logical query plan using algebraic laws to improve it, estimating sizes of intermediate results, and selecting a physical query plan based on cost estimates.
Spark plays an important role on data scientists to solve all kinds of problems, especially the release of SparkR which provide very friendly APIs for traditional data scientists. However, processing various data size, data format and models will lead to different application patterns compared with traditional R. In this talk, we will illustrate the practical experience that using SparkR to solve some typical data science problems, such as the performance improvement for SparkR and native R interoperation, how to load data from HBase which is a very common data source efficiently, how to schedule a large scale machine learning job with multiple single R machine learning jobs, how to tuning performance for jobs triggered by many different users, how to use SparkR in the cloud-based environment, etc. At last, we will shortly introduce the community efforts in progress on SparkR in the coming releases.
Speakers:
Yanbo Liang, Software Engineer, Hortonworks
Casey Stella, Principal Software Engineer/Data Scientist, Hortonworks
Spark plays an important role on data scientists to solve all kinds of problems, especially the release of SparkR which provide very friendly APIs for traditional data scientists. However, processing various data size, data format and models will lead to different application patterns compared with traditional R. In this talk, we will illustrate the practical experience that using SparkR to solve some typical data science problems, such as the performance improvement for SparkR and native R interoperation, how to load data from HBase which is a very common data source efficiently, how to schedule a large scale machine learning job with multiple single R machine learning jobs, how to tuning performance for jobs triggered by many different users, how to use SparkR in the cloud-based environment, etc. At last, we will shortly introduce the community efforts in progress on SparkR in the coming releases.
This document proposes using linear function approximation as a computationally efficient method for solving reinforcement learning problems compared to neural network approaches like TRPO and PPO. It summarizes TRPO and PPO, which use neural networks to approximate value functions. The paper then presents a natural actor-critic algorithm that uses linear function approximation instead of neural networks for value approximation. The author evaluates this approach on cart pole and acrobot benchmarks and finds it trains faster than neural network methods while achieving equivalent or better results, especially on sparse reward problems. This allows the paper to recommend using natural policy gradient methods with linear function approximation over TRPO and PPO for traditional and sparse reward low-dimensional reinforcement learning challenges.
This document discusses personalised search for the social semantic web. It introduces Datalog± as a language for representing ontologies and preferences, and describes three frameworks for representing qualitative, quantitative, and conditional preferences in Datalog±:
1. PP-Datalog± combines Datalog± with partial qualitative preferences and probabilistic uncertainty. It defines preference combination operators and an algorithm for top-k queries.
2. GPP-Datalog± generalizes PP-Datalog± to handle group preferences from multiple users with and without probabilistic uncertainty. It defines operators for merging single-user preferences and aggregating preferences of a group.
3. Challenges include preference merging when user preferences disagree with probabilistic
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
This document discusses deep reinforcement learning through policy optimization. It begins with an introduction to reinforcement learning and how deep neural networks can be used to approximate policies, value functions, and models. It then discusses how deep reinforcement learning can be applied to problems in robotics, business operations, and other machine learning domains. The document reviews how reinforcement learning relates to other machine learning problems like supervised learning and contextual bandits. It provides an overview of policy gradient methods and the cross-entropy method for policy optimization before discussing Markov decision processes, parameterized policies, and specific policy gradient algorithms like the vanilla policy gradient algorithm and trust region policy optimization.
This document provides an introduction to R, an open-source statistical software environment. It discusses key features of R including its data handling, statistical, and graphical capabilities. The document also summarizes popular statistical methods that can be performed in R like correlation, regression, time series analysis, and discusses the advantages and disadvantages of using R.
The document discusses the Pi-calculus, a formal system for describing processes that communicate via channels. It provides an overview of the history and motivation for Pi-calculus, its basic elements like variables, channels, and processes, examples of how it can model processes and lambda calculus, and rules for reducing Pi-calculus expressions. Pi-calculus is used to describe concurrent and distributed systems formally and can describe processes that change over time through communication.
Algorithm of Dynamic Programming for Paper-Reviewer Assignment ProblemIRJET Journal
The document proposes an algorithm for dynamically assigning papers to reviewers based on keywords. It discusses:
1) Existing exact string matching algorithms like brute force, KMP, and Boyer-Moore are ineffective for this problem since keyword phrases may be similar but not exact matches.
2) The algorithm uses dynamic programming to calculate an "expertise distance" between each paper's keywords and a reviewer's keywords based on the edit distance between the strings. Reviewers with lower expertise distances are better matches.
3) This approach accounts for minor differences in keyword phrases while still capturing the underlying semantic similarity, making it more suitable than exact string matching for the paper-reviewer assignment problem.
Practical Reinforcement Learning with TensorFlowIllia Polosukhin
This document discusses reinforcement learning techniques for solving problems modeled as Markov decision processes using TensorFlow. It introduces the OpenAI Gym environment for testing RL algorithms, describes modeling problems as MDPs and key concepts like state-value functions and Q-learning. Deep Q-learning and policy gradient methods are explained for approximating value and policy functions with neural networks. Asynchronous advantage actor-critic (A3C) is presented as an effective approach and results are shown matching or beating human performance on Atari games. Practical applications of RL are identified in robotics, finance, optimization and predictive assistants.
The document discusses query compilation and optimization. It covers parsing a SQL query into a parse tree, converting the parse tree into a logical query plan using relational algebra, and estimating the costs of different logical and physical query plans to select the most efficient plan. The key steps are parsing, rewriting the logical query plan using algebraic laws to improve it, estimating sizes of intermediate results, and selecting a physical query plan based on cost estimates.
Spark plays an important role on data scientists to solve all kinds of problems, especially the release of SparkR which provide very friendly APIs for traditional data scientists. However, processing various data size, data format and models will lead to different application patterns compared with traditional R. In this talk, we will illustrate the practical experience that using SparkR to solve some typical data science problems, such as the performance improvement for SparkR and native R interoperation, how to load data from HBase which is a very common data source efficiently, how to schedule a large scale machine learning job with multiple single R machine learning jobs, how to tuning performance for jobs triggered by many different users, how to use SparkR in the cloud-based environment, etc. At last, we will shortly introduce the community efforts in progress on SparkR in the coming releases.
Speakers:
Yanbo Liang, Software Engineer, Hortonworks
Casey Stella, Principal Software Engineer/Data Scientist, Hortonworks
Spark plays an important role on data scientists to solve all kinds of problems, especially the release of SparkR which provide very friendly APIs for traditional data scientists. However, processing various data size, data format and models will lead to different application patterns compared with traditional R. In this talk, we will illustrate the practical experience that using SparkR to solve some typical data science problems, such as the performance improvement for SparkR and native R interoperation, how to load data from HBase which is a very common data source efficiently, how to schedule a large scale machine learning job with multiple single R machine learning jobs, how to tuning performance for jobs triggered by many different users, how to use SparkR in the cloud-based environment, etc. At last, we will shortly introduce the community efforts in progress on SparkR in the coming releases.
This document proposes using linear function approximation as a computationally efficient method for solving reinforcement learning problems compared to neural network approaches like TRPO and PPO. It summarizes TRPO and PPO, which use neural networks to approximate value functions. The paper then presents a natural actor-critic algorithm that uses linear function approximation instead of neural networks for value approximation. The author evaluates this approach on cart pole and acrobot benchmarks and finds it trains faster than neural network methods while achieving equivalent or better results, especially on sparse reward problems. This allows the paper to recommend using natural policy gradient methods with linear function approximation over TRPO and PPO for traditional and sparse reward low-dimensional reinforcement learning challenges.
This document discusses personalised search for the social semantic web. It introduces Datalog± as a language for representing ontologies and preferences, and describes three frameworks for representing qualitative, quantitative, and conditional preferences in Datalog±:
1. PP-Datalog± combines Datalog± with partial qualitative preferences and probabilistic uncertainty. It defines preference combination operators and an algorithm for top-k queries.
2. GPP-Datalog± generalizes PP-Datalog± to handle group preferences from multiple users with and without probabilistic uncertainty. It defines operators for merging single-user preferences and aggregating preferences of a group.
3. Challenges include preference merging when user preferences disagree with probabilistic
1. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
My RL Approach to
Prediction and Control
Hengshuai Yao
University of Alberta
April 4, 2013
Hengshuai Yao Thesis Overview 1/33
2. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
Outline
• One-page Summary of my work (30 seconds)
• Background on Reinforcement Learning (RL) (8 slides; 6
minutes)
• Walkthrough my work (6 slides, 4 minutes)
• LAM-API: Large-scale Off-policy Learning and Control (5
slides; 5 minutes)
• Citation count prediction using RL (10 slides; 8 minutes)
Hengshuai Yao Thesis Overview 2/33
3. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
Summary of my work
Prediction
• A framework for existing prediction algorithms [ICML 08]
• Data efficiency for on-policy prediction (Multi-step linear Dyna [NIPS 09]) and
off-policy prediction (LAM-off-policy [ICML-workshop 10])
Control
• Memory and sample efficiency for control (LAM-API [AAAI 12])
• Online abstract planning with Universal Option Models [in preparation for JAIR
with Csaba, Rich and Shalabh]
• RL with general function approximation. Deterministic MDPs [in preparation for
Machine Learning Journal with Csaba]
RL applications
• Citation count prediction[submitted to IEEE Trans. on SMC-part B]
• Ranking [current work with Dale]
Hengshuai Yao Thesis Overview 3/33
4. Outline
One-page Summary of my work
Background MDPs
A Walkthrough of My work RL Prediction
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
Background on RL
I will go over
• MDPs: Definition, Policies, Value Functions, and more
• Prediction Problems (TD, Dyna, On-policy, Off-policy)
• The Control Problem (Value Iteration, Q-learning, LSPI)
Hengshuai Yao Thesis Overview 4/33
5. Outline
One-page Summary of my work
Background MDPs
A Walkthrough of My work RL Prediction
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
MDPs
An MDP is defined by a 5-tuple, γ, S, A, (P a )a∈A , (Ra )a∈A .
P a (s′ |s) = P0 (s′ |s, a)
Ra : S × S → R
(P a )a∈A and (Ra )a∈A are called the model or
transition-dynamics.
A policy, π : S × A → [0, 1], selects actions at states. Think about a
policy as a way of how you act everyday.
Hengshuai Yao Thesis Overview 5/33
6. Outline
One-page Summary of my work
Background MDPs
A Walkthrough of My work RL Prediction
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
My MDP example
t=0
policy π University
of Alberta
1.0
S: UofA, EDM, HK, Paomadi, t=1,r=$-100
Noah Edmonton
Airport
A: the set of the links 1.0
t=3, r_{horse}
(P a )a∈A : deterministic 0.1
Ra (s, s′ ) = r(s′ ), HK
Airport
π(U A, Edmonton) = 1, 1.0
t=2,r=$-1,000 0.9
π(HK, N oah) = 0.9,
π(HK, P aomadi) = 0.1; etc.
t=3, r=$10,000
Hengshuai Yao Thesis Overview 6/33
7. Outline
One-page Summary of my work
Background MDPs
A Walkthrough of My work RL Prediction
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
Value function
∞
V π (s) = E γ t rt+1 | s0 = s, at ∼ π(st , ·)
t=0
Optimal policy
∗
V ∗ (s) = V π (s) = max V π (s), for all s ∈ S.
π
V π (U of A) = −100 − 1000γ + γ 2 (0.1 × (−1000) + 0.9 × 10, 000)
If rhorse = −1, 000:
V ∗ (U of A) = −100 − 1000γ + γ 2 ( 1.0 ×10, 000)
=π ∗ (HK,N oah)
rhorse = 1, 000, 000:
V ∗ (U of A) = −100 − 1000γ + γ 2 ( 1.0 ×1, 000, 000)
=π ∗ (HK,P aomadi)
Hengshuai Yao Thesis Overview 7/33
8. Outline
One-page Summary of my work
Background MDPs
A Walkthrough of My work RL Prediction
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
MDPs cont’–Dynamic programming
Bellman equation. For all s ∈ S, for any policy π, one-step
look-ahead:
V π (s) = r(s) + γES ′ ∼π(s,·) [V π (S ′ )],
¯
where r(s) = s′ P π (s, s′ )r(s, s′ ); r(s, s′ ) = a∈A P a (s, s′ )Ra (s, s′ ).
¯
Solving V π for an ordinary policy π is called policy evaluation. Simple, power iteration.
∗
Solving V π or π ∗ is called control, usually using value iteration:
Vk+1 (s) = max E[rt+1 + γVk (st+1 ) | st = s, at = a]
a
= max P a (s′ |s)(Ra (s, s′ ) + γVk (s′ ))
a
s′
Policy iteration is similar.
Hengshuai Yao Thesis Overview 8/33
9. Outline
One-page Summary of my work
Background MDPs
A Walkthrough of My work RL Prediction
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
RL
Features of RL:
• Sample-based learning. No model.
• Only intermediate rewards are observed.
• Partially observable, e.g., citation count prediction.
Prediction/Control: solving V π (for some π) or V ∗ using samples.
Sample efficiency and memory is important. Algorithms:
• TD, Q-learning [Barto et. al. 83; Sutton 88; Dayan 92; Bertsekas 96]
• Dyna [Sutton et. al. 91] and linear Dyna [Sutton el.al.08].
• LSTD [Boyan 02], LSPI [Lagoudakis and Parr, 03]
• GTD [Sutton et. al. 09; Maie 10 el. al. 10]
Hengshuai Yao Thesis Overview 9/33
10. Outline
One-page Summary of my work
Background MDPs
A Walkthrough of My work RL Prediction
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
Prediction
Feature mapping: φ : S → Rn , n being the number of features.
Linear Function Approximation (LFA). We approximate V π by
ˆ
V π (s) = φ(s)⊤ θ,
for s ∈ S, where θ is the parameter vector (to learn).
Samples (this is our Big Data)
D = ( φ(st ), at , rt+1 , φ(st+1 ) )t=1,2,...,T
Prediction: Given an input policy π, output an estimate of the
value function V π . We learn a predictor on D using φ.
On-policy: D is created by following π.
Off-policy: D is not created by π.
Hengshuai Yao Thesis Overview 10/33
11. Outline
One-page Summary of my work
Background MDPs
A Walkthrough of My work RL Prediction
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
Prediction-cont’-TD (Sutton, 88)
Given the tuple φ(st ), at , rt+1 , φ(st+1 ) , Temporal Difference
(TD) learning (without eligibility trace):
ˆ ˆ
δt = rt+1 + γ V (st+1 ) − V (st ),
δt is called the TD error, which is a sample of the Bellman
residual:
E[δt |st = s] = r (s) + γ
¯ ˆ ˆ
P π (s, s′ )V π (s′ ) − V π (s).
s′ ∈S
∆θt ∝ αt δt φ(st )
Hengshuai Yao Thesis Overview 11/33
12. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
Preconditioning Framework [ICML 08]
Previously: Issues of step-size, sample efficiency, and sparsity:
LSTD [Boyan 02], LSPE [Bertsekas et. al. 96, 03, 04], FPKF
[Van Roy 06], iLSTD [Geramifard et. al. 06, 07].
Contribution: I proposed a general class of algorithms by
applying the preconditioning technique in iterative analysis,
which includes the above mentioned algorithms. I solved all
these issues in this framework. Empirical results: the step-size
adaptation learns much quicker; sparsity based storage and
computation is more efficient.
Hengshuai Yao Thesis Overview 12/33
13. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
Multi-step Linear Dyna [NIPS 09]
Previously: Online planning is believed to speed up learning
and makes better decisions (mostly tabular), but “Model-based
is poorer than model-free”. Dyna [Sutton et. al. 92]/linear Dyna [Sutton
et.al.08] is an integrated architecture for real time acting, learning, modeling, and
planning without waiting for each other to complete. However, linear Dyna was found to
perform inferior to (model-free) TD learning [Sutton et.al.08].
Contribution: I improved linear Dyna [Sutton et.al. 08] to
perform much better than TD. I also extended linear Dyna from
single-step to multi-step planning, and demonstrated on
Mountain-car (an under-powered car climbing a mountain) that
multi-step planning predicts more accurately than single-step.
Hengshuai Yao Thesis Overview 13/33
14. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
LAM based off-policy learning
[ICML-workshop 10]
Previously: Off-policy learning is ubiquitous. TD diverges but is
reasonably fast if it converges. GTD algorithms [Sutton et. al.
09,10,11] converge but are slow.
Contribution: I used linear action model based framework for
off-policy learning. It can learn various policies in parallel from
a single stream of data, for quick real time decision making.
Evaluated on two continuous-state, hard control problems. I
recommend using LAM based off-policy learning in place of
on-policy learning.
Hengshuai Yao Thesis Overview 14/33
15. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
Deterministic MDPs[with Csaba]
Previously. Theory: state aggregation, LFA ; Practice: LFA, and
neural network
Contribution: A very general framework for RL with function
approximation. We propose to view all RL methods as building
some correspondence MDP, which has a smaller state space
than the original. We solve the correspondence MDP and lift
the policies and value functions found there back to the original.
A few important results are proved (20+ lemmas and
theorems). This framework is helpful in understanding existing
algorithms as well as developing new ones.
Hengshuai Yao Thesis Overview 15/33
16. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
Reinforcement Ranking [with Dale]
Bellman equation looks familiar to you? PageRank? Stationary
distribution?
Contribution: We proposed a framework of discovering
authorities using rewards defined for pages/links. No stationary
distribution at all, but still guaranteed to converge. Evaluation is
performed on Wikipedia, DBLP and WebSpamUK. Compared
the precision and recall with PageRank and TrustRank.
Promising results on Wikipedia and DBLP.
Hengshuai Yao Thesis Overview 16/33
17. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
Universal Option Models [with R.C.S]
Previously: Options are used to describe high-level decisions. The execution of an
Traditional option models
option is a sequence of actions (abstraction).
consist in a reward part and and a state-prediction part. Very
inefficient for multiple reward functions (or reward function
changes dynamically).
Contribution: We proposed a new way of modeling
options—removing the reward part but adding a state
occupancy part. We proved that, (a) given any reward function,
we can construct the return of the option from the new model;
(b) with the new model we can recover the TD solution without
any planning computation. On a simulated Star-craft 2 game, it
is much more efficient for planning than the traditional model.
Very suitable for large real-time games.
Hengshuai Yao Thesis Overview 17/33
18. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
LAM-API [AAAI 12]
Previous API solutions (experience replay [Lin 92], LSPI [Lagoudakis and Parr 03])
have to remember the sample set D and sweep all the samples in each iteration. D
can be very large in practice.
Key idea: Summarize your Big Data with a model. Work with
the model.
First, we learn a linear model, F a , f a for each action a, from the samples. For a
given action a and any given state s ∈ S, with s′ ∼ P a (s, ·), we expect that
F a φ(s) ≈ E[φ(s′ )] and (f a )⊤ φ(s) ≈ E[Ra (s, s′ )].
Complexity of modeling: O(T n2 ).
Second, we use all the LAMs to perform API. Complexity: O(Ln2 Niter )
LAM-API: O(T n2 ) + O(|L|n2 Niter )
LSPI: O(T n2 Niter )
Big Data: T ≫ |L|, which means,
LAM-API—O(T n2 ) ≪ O(T n2 Niter )—LSPI
Hengshuai Yao Thesis Overview 18/33
19. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
LAM-API-cont’
Algorithm 1 LAM-API with LSTD
Input: a list of features L = {φi }, a LAM ( F a , f a )a∈A . Output: a weight vector θ.
Initialize θ
repeat for Niter iterations
for φi from L do
Select greedy action:
a∗ = arg maxa {(f a )⊤ φi + γθ ⊤ F a φi }
Select model:
∗ ∗
F ∗ = F a , f∗ = fa
Produce prediction for the next feature vector and reward:
˜
φi+1 = F ∗ φi
ri = (f ∗ )⊤ φi
˜
Accumulate LSTD structures:
˜
A = A + φi (γ φi+1 − φi )⊤
b = b + φi r i
˜
θ = −A−1 b
Hengshuai Yao Thesis Overview 19/33
20. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
LAM-API-cont’
Compared learning quality with LSPI. L = {φi | φi ← Di }.
Chain-walk problems. Left: 4-state chain. Right: LAM-LSTD on the 50-state chain.
LAM-LSTD converges in 4 iterations. At iteration 2, the policy is already optimal.
Iter. 0#: all actions are ’R’
0.1
0.05
0
−0.05
5 10 15 20 25 30 35 40 45 50
Iter. 1#: RRRRRRRRRRRLLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRLLLLLL
Value Functions
0.6
0.4
0.2
0
5 10 15 20 25 30 35 40 45 50
Iter. 2#: RRRRRRRRRLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRLLLLLLLLL
0.6
0.4
0.2
0
5 10 15 20 25 30 35 40 45 50
State No#
Hengshuai Yao Thesis Overview 20/33
21. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
LAM-API-cont’
LSPI converges in 14 iterations; found the optimal policy at the
last iteration
Iter. 0#: all actions are ’R’ Iter. 8#: RRRRLLRRRRRRLLLLLLLRRRLLLLLRRRRRRLLLLRRRRRRRLLLLLL
1 5
0
0 −5
−10
−1 −15
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Iter. 1#: LLLLLRRRRRRLLLLLLLLLLLLLLLLLLLLLLLRRRRRRRRRLLLLLLR Iter. 9#: LRRRRRRRRRRLLLLLLLLLLLLLLLLLLLLLRRRRRRRRRLLLLLLLLR
Value Functions
Value Functions
2
2
1
0
0
−2
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Iter. 7#: LRRRRRRRRLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRLLLLLLLRR Iter. 14#: RRRRRRRRRLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRLLLLLLLLL
2 2
1 1
0 0
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
State No# State No#
(c) LSPI: iteration 0,1,7 (d) LSPI: iteration 8, 9, 14
Hengshuai Yao Thesis Overview 21/33
22. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
LAM-API-cont’-Cart-Pole
Goal: Keep the pendulum above horizontal. (for a maximum of 3000 steps).
Reward: binary; State: angle and angular velocity (both continuous)
3000
2500
LSPI, average
LAM−LSTD/LSPI, best
2000
#Balanced steps
LAM−LSTD, average
1500
LSPI, worst
1000 LAM−LSTD, worst
500
0
0 200 400 600 800 1000
#Training Episodes
(e) Cart-pole balancing (f) Balancing steps
Why important? LSPI [Lagoudakis and Parr 03] widely used; “LSPI is arguably the
most competitive RL algorithm available in large environments.” [Li et. al. 09].
Hengshuai Yao Thesis Overview 22/33
23. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
Citation Count Prediction
Citation count: the most used measure in academics.
Predicting it is interesting. We studied the prediction of the
citation count for papers.
Previously, [Yan et. al. 11] [Fu 08] studied a citation count
prediction problem using SL.
Training (spacial):
Input → Output
x: feature vectors in 1990 → y: citation counts until 2000
Given a paper’s features in 1990—predict.
Now given a paper’s features in 2000—? (a temporal aspect)
Hengshuai Yao Thesis Overview 23/33
24. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
Citation Count Prediction–cont’
Citation count prediction is temporal.
Problem formulation. Define the “value” of a paper p at year t by
the sum of discounted numbers of citations in all the
subsequent years:
∞
V (p, t) = γ q−t cq , γ ∈ [0, 1)
q=t
where cq is the number of citations the paper receives in year q.
When t is the publication year of the paper and γ approaches one, V (p, t) will be
virtually close to the total number of citation counts for the paper.
Question: What is my state here? st = (p, t)
Hengshuai Yao Thesis Overview 24/33
25. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
Citation Count Prediction–cont’
We represent by
ˆ
V (p, t) = φ(p, t)⊤ θ.
Samples: a data set,
D = ∪p∈P Dp ; Dp = ( φ(p, t), ct+1 , φ(p, t + 1) )t=1990,1991,...,2000
Features. φ(p, t): a vector, having entries for, e.g., the number
of citations for each author till year t, the number of citations for
the venue till year t, etc.
Hengshuai Yao Thesis Overview 25/33
26. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
Citation Count Prediction–cont’
Long term: predict more than 10 years. TD and LSTD
Short term: predict in k (k < 10) years
• not a standard RL problem
• We extended LAM to this context and proposed a model-based
prediction method.
Key idea: learn a model F, f from year-to-year status change of
papers. Given φ(p, t), ct+1 , φ(p, t + 1) , update
∆F = α [φ(p, t + 1) − F φ(p, t)] φ(p, t)⊤ ,
and
∆f = α ct+1 − f ⊤ φ(p, t) φ(p, t).
Hengshuai Yao Thesis Overview 26/33
27. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
Citation Count Prediction–cont’
What we learned?
f : a one-year predictor
F : multiple one-year predictors
#CC of first author #CC of first author
#CC of last author #CC of last author
my #CC of in the my #CC of in the
last few years last few years
... ...
... ...
#CC of the citing #CC of the citing
papers papers
Linear transient model
2000 2001
Hengshuai Yao Thesis Overview 27/33
28. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
Citation Count Prediction–cont’
How do we use the model?
Given: the feature vector of a paper s at a year t = 2012, φ(s2012 )
citation count in 2013: c1 = f ⊤ φ(s2012 ).
ˆ
citation count in 2014: We need φ(s2013 ) (unavailable). We can
ˆ def
predict the features: φ2013 = F φ(s2012 ). Then we combine f again to
predict by
c2 =
ˆ ˆ
f ⊤ φ2013
Using a prediction to predict
This generalizes the key idea of TD, linear Dyna [Sutton et.al. 08],
LAM-API to more general multi-step prediction problems.
Similarly we can extrapolate into more years into the future.
Hengshuai Yao Thesis Overview 28/33
29. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
Citation Count Prediction–Empirical
“Now” is 2002.
Training data: the citation counts of 7K papers from 1990 to “Now”.
Test data: their citation counts from “Now” to 2012.
Algorithms: LS/SVR, LSTD
LS (training) LS (predicting future)
600 5000
500
4000
400
Prediction
Prediction
3000
300
2000
200
1000
100
0 0
0 100 200 300 400 500 600 0 100 200 300 400 500 600 700
True Value True Value
(g) LS-train (h) LS-test
Hengshuai Yao Thesis Overview 29/33
30. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
Citation Count Prediction–Empirical
LSTD successfully generalizes over time for the training papers.
LSTD
800
600
Prediction
400
200
0
0 200 400 600
True Value
(i) LS-test
Hengshuai Yao Thesis Overview 30/33
31. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
Citation Count Prediction–Empirical
Predicting for test papers (newer than the training papers)
Papers are marked according to year of publication. black, green, blue, red, magenta:
• in Left plot, correspond to papers published in 1990, 1991, 1992, 1993, 1994
• in Middle plot, correspond to papers published in 1995, 1996, 1997, 1998, 1999
In Right plot: black, green—papers published in 2000, 2001.
True citation count: cross (+) marked; Prediction: star (*) marked
LSTD successfully generalizes over time for new papers
LSTD(0) LSTD(0) LSTD(0)
1200 900 800
800 700
1000
700 600
800 600
500
Prediction
Prediction
Prediction
500
600 400
400
300
400 300
200 200
200
100 100
0 0 0
0 200 400 600 800 1000 1200 0 200 400 600 800 0 200 400 600 800
True Value True Value True Value
(j) papers 90-94 (k) papers 95-99 (l) papers 00-01
Hengshuai Yao Thesis Overview 31/33
32. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
Citation Count Prediction–Empirical
Short-term prediction: the performance of the proposed method
Dyna: 4−year prediction
400 35
350 30
paper−2000−2001
300 25
250 paper−1995−2000
Prediction
20
RMSE
200
15 paper−1990−1995
150
10
100
5 paper−before1990
50
0 0
0 100 200 300 400 1 2 3 4 5 6 7 8
True Value Years into the future
(m) 4 year; Papers 00-01 (n) Summary
Hengshuai Yao Thesis Overview 32/33
33. Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL
Thank You!
Hengshuai Yao Thesis Overview 33/33