Hengshuai noah

Outline
One-page Summary of my work
Background
A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
Citation Count Prediction using RL

My RL Approach to
Prediction and Control
Hengshuai Yao

University of Alberta

April 4, 2013

Hengshuai Yao Thesis Overview 1/33

Outline
Background

Outline

• One-page Summary of my work (30 seconds)
• Background on Reinforcement Learning (RL) (8 slides; 6
minutes)
• Walkthrough my work (6 slides, 4 minutes)
• LAM-API: Large-scale Off-policy Learning and Control (5
slides; 5 minutes)
• Citation count prediction using RL (10 slides; 8 minutes)


Outline
Background

Summary of my work
Prediction
• A framework for existing prediction algorithms [ICML 08]
• Data efﬁciency for on-policy prediction (Multi-step linear Dyna [NIPS 09]) and
off-policy prediction (LAM-off-policy [ICML-workshop 10])
Control
• Memory and sample efﬁciency for control (LAM-API [AAAI 12])
• Online abstract planning with Universal Option Models [in preparation for JAIR
with Csaba, Rich and Shalabh]
• RL with general function approximation. Deterministic MDPs [in preparation for
Machine Learning Journal with Csaba]
RL applications
• Citation count prediction[submitted to IEEE Trans. on SMC-part B]
• Ranking [current work with Dale]


Outline
Background MDPs
A Walkthrough of My work RL Prediction

Background on RL

I will go over
• MDPs: Deﬁnition, Policies, Value Functions, and more
• Prediction Problems (TD, Dyna, On-policy, Off-policy)
• The Control Problem (Value Iteration, Q-learning, LSPI)


Outline
Background MDPs

MDPs
An MDP is deﬁned by a 5-tuple, γ, S, A, (P a )a∈A , (Ra )a∈A .

P a (s′ |s) = P0 (s′ |s, a)

Ra : S × S → R
(P a )a∈A and (Ra )a∈A are called the model or
transition-dynamics.
A policy, π : S × A → [0, 1], selects actions at states. Think about a
policy as a way of how you act everyday.


Outline
Background MDPs

My MDP example
t=0
policy π University
of Alberta

1.0
S: UofA, EDM, HK, Paomadi, t=1,r=$-100

Noah Edmonton
Airport

A: the set of the links 1.0
t=3, r_{horse}
(P a )a∈A : deterministic 0.1
Ra (s, s′ ) = r(s′ ), HK
Airport

π(U A, Edmonton) = 1, 1.0
t=2,r=$-1,000 0.9
π(HK, N oah) = 0.9,
π(HK, P aomadi) = 0.1; etc.

t=3, r=$10,000


Outline
Background MDPs

Value function
∞
V π (s) = E γ t rt+1 | s0 = s, at ∼ π(st , ·)
t=0

Optimal policy
∗
V ∗ (s) = V π (s) = max V π (s), for all s ∈ S.
π

V π (U of A) = −100 − 1000γ + γ 2 (0.1 × (−1000) + 0.9 × 10, 000)
If rhorse = −1, 000:
V ∗ (U of A) = −100 − 1000γ + γ 2 ( 1.0 ×10, 000)
=π ∗ (HK,N oah)

rhorse = 1, 000, 000:
V ∗ (U of A) = −100 − 1000γ + γ 2 ( 1.0 ×1, 000, 000)
=π ∗ (HK,P aomadi)

Outline
Background MDPs

MDPs cont’–Dynamic programming
Bellman equation. For all s ∈ S, for any policy π, one-step
look-ahead:

V π (s) = r(s) + γES ′ ∼π(s,·) [V π (S ′ )],
¯

where r(s) = s′ P π (s, s′ )r(s, s′ ); r(s, s′ ) = a∈A P a (s, s′ )Ra (s, s′ ).
¯
Solving V π for an ordinary policy π is called policy evaluation. Simple, power iteration.
∗
Solving V π or π ∗ is called control, usually using value iteration:

Vk+1 (s) = max E[rt+1 + γVk (st+1 ) | st = s, at = a]
a

= max P a (s′ |s)(Ra (s, s′ ) + γVk (s′ ))
a
s′

Policy iteration is similar.

Outline
Background MDPs

RL
Features of RL:
• Sample-based learning. No model.
• Only intermediate rewards are observed.
• Partially observable, e.g., citation count prediction.
Prediction/Control: solving V π (for some π) or V ∗ using samples.
Sample efﬁciency and memory is important. Algorithms:
• TD, Q-learning [Barto et. al. 83; Sutton 88; Dayan 92; Bertsekas 96]

• Dyna [Sutton et. al. 91] and linear Dyna [Sutton el.al.08].
• LSTD [Boyan 02], LSPI [Lagoudakis and Parr, 03]
• GTD [Sutton et. al. 09; Maie 10 el. al. 10]


Outline
Background MDPs

Prediction
Feature mapping: φ : S → Rn , n being the number of features.
Linear Function Approximation (LFA). We approximate V π by
ˆ
V π (s) = φ(s)⊤ θ,

for s ∈ S, where θ is the parameter vector (to learn).
Samples (this is our Big Data)

D = ( φ(st ), at , rt+1 , φ(st+1 ) )t=1,2,...,T

Prediction: Given an input policy π, output an estimate of the
value function V π . We learn a predictor on D using φ.
On-policy: D is created by following π.
Off-policy: D is not created by π.

Outline
Background MDPs

Prediction-cont’-TD (Sutton, 88)
Given the tuple φ(st ), at , rt+1 , φ(st+1 ) , Temporal Difference
(TD) learning (without eligibility trace):
ˆ ˆ
δt = rt+1 + γ V (st+1 ) − V (st ),

δt is called the TD error, which is a sample of the Bellman
residual:

E[δt |st = s] = r (s) + γ
¯ ˆ ˆ
P π (s, s′ )V π (s′ ) − V π (s).
s′ ∈S

∆θt ∝ αt δt φ(st )


Outline
Background

Preconditioning Framework [ICML 08]

Previously: Issues of step-size, sample efﬁciency, and sparsity:
LSTD [Boyan 02], LSPE [Bertsekas et. al. 96, 03, 04], FPKF
[Van Roy 06], iLSTD [Geramifard et. al. 06, 07].
Contribution: I proposed a general class of algorithms by
applying the preconditioning technique in iterative analysis,
which includes the above mentioned algorithms. I solved all
these issues in this framework. Empirical results: the step-size
adaptation learns much quicker; sparsity based storage and
computation is more efﬁcient.


Outline
Background

Multi-step Linear Dyna [NIPS 09]
Previously: Online planning is believed to speed up learning
and makes better decisions (mostly tabular), but “Model-based
is poorer than model-free”. Dyna [Sutton et. al. 92]/linear Dyna [Sutton
et.al.08] is an integrated architecture for real time acting, learning, modeling, and
planning without waiting for each other to complete. However, linear Dyna was found to
perform inferior to (model-free) TD learning [Sutton et.al.08].
Contribution: I improved linear Dyna [Sutton et.al. 08] to
perform much better than TD. I also extended linear Dyna from
single-step to multi-step planning, and demonstrated on
Mountain-car (an under-powered car climbing a mountain) that
multi-step planning predicts more accurately than single-step.


Outline
Background

LAM based off-policy learning
[ICML-workshop 10]
Previously: Off-policy learning is ubiquitous. TD diverges but is
reasonably fast if it converges. GTD algorithms [Sutton et. al.
09,10,11] converge but are slow.
Contribution: I used linear action model based framework for
off-policy learning. It can learn various policies in parallel from
a single stream of data, for quick real time decision making.
Evaluated on two continuous-state, hard control problems. I
recommend using LAM based off-policy learning in place of
on-policy learning.


Outline
Background

Deterministic MDPs[with Csaba]
Previously. Theory: state aggregation, LFA ; Practice: LFA, and
neural network
Contribution: A very general framework for RL with function
approximation. We propose to view all RL methods as building
some correspondence MDP, which has a smaller state space
than the original. We solve the correspondence MDP and lift
the policies and value functions found there back to the original.
A few important results are proved (20+ lemmas and
theorems). This framework is helpful in understanding existing
algorithms as well as developing new ones.


Outline
Background

Reinforcement Ranking [with Dale]

Bellman equation looks familiar to you? PageRank? Stationary
distribution?
Contribution: We proposed a framework of discovering
authorities using rewards deﬁned for pages/links. No stationary
distribution at all, but still guaranteed to converge. Evaluation is
performed on Wikipedia, DBLP and WebSpamUK. Compared
the precision and recall with PageRank and TrustRank.
Promising results on Wikipedia and DBLP.


Outline
Background

Universal Option Models [with R.C.S]
Previously: Options are used to describe high-level decisions. The execution of an
Traditional option models
option is a sequence of actions (abstraction).
consist in a reward part and and a state-prediction part. Very
inefﬁcient for multiple reward functions (or reward function
changes dynamically).
Contribution: We proposed a new way of modeling
options—removing the reward part but adding a state
occupancy part. We proved that, (a) given any reward function,
we can construct the return of the option from the new model;
(b) with the new model we can recover the TD solution without
any planning computation. On a simulated Star-craft 2 game, it
is much more efﬁcient for planning than the traditional model.
Very suitable for large real-time games.

Outline
Background

LAM-API [AAAI 12]
Previous API solutions (experience replay [Lin 92], LSPI [Lagoudakis and Parr 03])
have to remember the sample set D and sweep all the samples in each iteration. D
can be very large in practice.
Key idea: Summarize your Big Data with a model. Work with
the model.
First, we learn a linear model, F a , f a for each action a, from the samples. For a
given action a and any given state s ∈ S, with s′ ∼ P a (s, ·), we expect that

F a φ(s) ≈ E[φ(s′ )] and (f a )⊤ φ(s) ≈ E[Ra (s, s′ )].

Complexity of modeling: O(T n2 ).
Second, we use all the LAMs to perform API. Complexity: O(Ln2 Niter )
LAM-API: O(T n2 ) + O(|L|n2 Niter )
LSPI: O(T n2 Niter )
Big Data: T ≫ |L|, which means,
LAM-API—O(T n2 ) ≪ O(T n2 Niter )—LSPI


Outline
Background

LAM-API-cont’
Algorithm 1 LAM-API with LSTD
Input: a list of features L = {φi }, a LAM ( F a , f a )a∈A . Output: a weight vector θ.
Initialize θ
repeat for Niter iterations
for φi from L do
Select greedy action:
a∗ = arg maxa {(f a )⊤ φi + γθ ⊤ F a φi }
Select model:
∗ ∗
F ∗ = F a , f∗ = fa
Produce prediction for the next feature vector and reward:
˜
φi+1 = F ∗ φi
ri = (f ∗ )⊤ φi
˜
Accumulate LSTD structures:
˜
A = A + φi (γ φi+1 − φi )⊤
b = b + φi r i
˜
θ = −A−1 b


Outline
Background

LAM-API-cont’
Compared learning quality with LSPI. L = {φi | φi ← Di }.
Chain-walk problems. Left: 4-state chain. Right: LAM-LSTD on the 50-state chain.
LAM-LSTD converges in 4 iterations. At iteration 2, the policy is already optimal.

Iter. 0#: all actions are ’R’

0.1
0.05
0
−0.05
5 10 15 20 25 30 35 40 45 50
Iter. 1#: RRRRRRRRRRRLLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRLLLLLL

Value Functions
0.6
0.4
0.2
0
5 10 15 20 25 30 35 40 45 50
Iter. 2#: RRRRRRRRRLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRLLLLLLLLL

0.6
0.4
0.2
0
5 10 15 20 25 30 35 40 45 50
State No#


Outline
Background

LAM-API-cont’
LSPI converges in 14 iterations; found the optimal policy at the
last iteration
Iter. 0#: all actions are ’R’ Iter. 8#: RRRRLLRRRRRRLLLLLLLRRRLLLLLRRRRRRLLLLRRRRRRRLLLLLL
1 5
0
0 −5
−10
−1 −15
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50

Iter. 1#: LLLLLRRRRRRLLLLLLLLLLLLLLLLLLLLLLLRRRRRRRRRLLLLLLR Iter. 9#: LRRRRRRRRRRLLLLLLLLLLLLLLLLLLLLLRRRRRRRRRLLLLLLLLR

Value Functions
Value Functions

2
2
1
0
0
−2
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50

Iter. 7#: LRRRRRRRRLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRLLLLLLLRR Iter. 14#: RRRRRRRRRLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRLLLLLLLLL

2 2

1 1

0 0
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
State No# State No#

(c) LSPI: iteration 0,1,7 (d) LSPI: iteration 8, 9, 14


Outline
Background

LAM-API-cont’-Cart-Pole
Goal: Keep the pendulum above horizontal. (for a maximum of 3000 steps).
Reward: binary; State: angle and angular velocity (both continuous)

3000

2500
LSPI, average
LAM−LSTD/LSPI, best
2000

#Balanced steps
LAM−LSTD, average
1500
LSPI, worst

1000 LAM−LSTD, worst

500

0
0 200 400 600 800 1000
#Training Episodes

(e) Cart-pole balancing (f) Balancing steps

Why important? LSPI [Lagoudakis and Parr 03] widely used; “LSPI is arguably the
most competitive RL algorithm available in large environments.” [Li et. al. 09].

Outline
Background

Citation Count Prediction
Citation count: the most used measure in academics.
Predicting it is interesting. We studied the prediction of the
citation count for papers.
Previously, [Yan et. al. 11] [Fu 08] studied a citation count
prediction problem using SL.
Training (spacial):

Input → Output
x: feature vectors in 1990 → y: citation counts until 2000

Given a paper’s features in 1990—predict.
Now given a paper’s features in 2000—? (a temporal aspect)


Outline
Background

Citation Count Prediction–cont’
Citation count prediction is temporal.
Problem formulation. Deﬁne the “value” of a paper p at year t by
the sum of discounted numbers of citations in all the
subsequent years:
∞
V (p, t) = γ q−t cq , γ ∈ [0, 1)
q=t

where cq is the number of citations the paper receives in year q.
When t is the publication year of the paper and γ approaches one, V (p, t) will be
virtually close to the total number of citation counts for the paper.
Question: What is my state here? st = (p, t)

Outline
Background

We represent by
ˆ
V (p, t) = φ(p, t)⊤ θ.

Samples: a data set,

D = ∪p∈P Dp ; Dp = ( φ(p, t), ct+1 , φ(p, t + 1) )t=1990,1991,...,2000

Features. φ(p, t): a vector, having entries for, e.g., the number
of citations for each author till year t, the number of citations for
the venue till year t, etc.


Outline
Background

Long term: predict more than 10 years. TD and LSTD
Short term: predict in k (k < 10) years
• not a standard RL problem
• We extended LAM to this context and proposed a model-based
prediction method.

Key idea: learn a model F, f from year-to-year status change of
papers. Given φ(p, t), ct+1 , φ(p, t + 1) , update

∆F = α [φ(p, t + 1) − F φ(p, t)] φ(p, t)⊤ ,

and
∆f = α ct+1 − f ⊤ φ(p, t) φ(p, t).


Outline
Background

What we learned?
f : a one-year predictor
F : multiple one-year predictors

#CC of ﬁrst author #CC of ﬁrst author

#CC of last author #CC of last author

my #CC of in the my #CC of in the
last few years last few years
... ...

... ...
#CC of the citing #CC of the citing
papers papers

Linear transient model
2000 2001


Outline
Background

How do we use the model?
Given: the feature vector of a paper s at a year t = 2012, φ(s2012 )
citation count in 2013: c1 = f ⊤ φ(s2012 ).
ˆ
citation count in 2014: We need φ(s2013 ) (unavailable). We can
ˆ def
predict the features: φ2013 = F φ(s2012 ). Then we combine f again to
predict by
c2 =
ˆ ˆ
f ⊤ φ2013
Using a prediction to predict
This generalizes the key idea of TD, linear Dyna [Sutton et.al. 08],
LAM-API to more general multi-step prediction problems.
Similarly we can extrapolate into more years into the future.


Outline
Background

Citation Count Prediction–Empirical
“Now” is 2002.
Training data: the citation counts of 7K papers from 1990 to “Now”.
Test data: their citation counts from “Now” to 2012.
Algorithms: LS/SVR, LSTD
LS (training) LS (predicting future)
600 5000

500
4000

400
Prediction
Prediction

3000
300
2000
200

1000
100

0 0
0 100 200 300 400 500 600 0 100 200 300 400 500 600 700
True Value True Value
(g) LS-train (h) LS-test

Outline
Background

LSTD successfully generalizes over time for the training papers.

LSTD
800

600
Prediction

400

200

0
0 200 400 600
True Value
(i) LS-test


Outline
Background

Predicting for test papers (newer than the training papers)
Papers are marked according to year of publication. black, green, blue, red, magenta:
• in Left plot, correspond to papers published in 1990, 1991, 1992, 1993, 1994
• in Middle plot, correspond to papers published in 1995, 1996, 1997, 1998, 1999
In Right plot: black, green—papers published in 2000, 2001.
True citation count: cross (+) marked; Prediction: star (*) marked
LSTD successfully generalizes over time for new papers

LSTD(0) LSTD(0) LSTD(0)
1200 900 800
800 700
1000
700 600
800 600
500
Prediction

Prediction

Prediction
500
600 400
400
300
400 300
200 200
200
100 100

0 0 0
0 200 400 600 800 1000 1200 0 200 400 600 800 0 200 400 600 800
True Value True Value True Value
(j) papers 90-94 (k) papers 95-99 (l) papers 00-01


Outline
Background

Short-term prediction: the performance of the proposed method

Dyna: 4−year prediction
400 35

350 30
paper−2000−2001
300 25
250 paper−1995−2000
Prediction

20

RMSE
200
15 paper−1990−1995
150
10
100
5 paper−before1990
50

0 0
0 100 200 300 400 1 2 3 4 5 6 7 8
True Value Years into the future
(m) 4 year; Papers 00-01 (n) Summary


Outline
Background

Thank You!


Hengshuai noah

Recommended

Recommended

More Related Content

Similar to Hengshuai noah

Similar to Hengshuai noah (12)

Hengshuai noah