SlideShare a Scribd company logo
Outline
                     One-page Summary of my work
                                        Background
                          A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
                 Citation Count Prediction using RL




              My RL Approach to
             Prediction and Control
                                          Hengshuai Yao

                                          University of Alberta


                                            April 4, 2013



                                     Hengshuai Yao     Thesis Overview   1/33
Outline
                      One-page Summary of my work
                                         Background
                           A Walkthrough of My work
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



Outline

        • One-page Summary of my work (30 seconds)
        • Background on Reinforcement Learning (RL) (8 slides; 6
           minutes)
        • Walkthrough my work (6 slides, 4 minutes)
        • LAM-API: Large-scale Off-policy Learning and Control (5
           slides; 5 minutes)
        • Citation count prediction using RL (10 slides; 8 minutes)




                                      Hengshuai Yao     Thesis Overview   2/33
Outline
                      One-page Summary of my work
                                         Background
                           A Walkthrough of My work
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



Summary of my work
   Prediction
        •   A framework for existing prediction algorithms [ICML 08]
        •   Data efficiency for on-policy prediction (Multi-step linear Dyna [NIPS 09]) and
            off-policy prediction (LAM-off-policy [ICML-workshop 10])
   Control
        •   Memory and sample efficiency for control (LAM-API [AAAI 12])
        •   Online abstract planning with Universal Option Models [in preparation for JAIR
            with Csaba, Rich and Shalabh]
        •   RL with general function approximation. Deterministic MDPs [in preparation for
            Machine Learning Journal with Csaba]
   RL applications
        •   Citation count prediction[submitted to IEEE Trans. on SMC-part B]
        •   Ranking [current work with Dale]


                                      Hengshuai Yao     Thesis Overview   3/33
Outline
                      One-page Summary of my work
                                         Background     MDPs
                           A Walkthrough of My work     RL Prediction
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



Background on RL


   I will go over
        • MDPs: Definition, Policies, Value Functions, and more
        • Prediction Problems (TD, Dyna, On-policy, Off-policy)
        • The Control Problem (Value Iteration, Q-learning, LSPI)




                                      Hengshuai Yao     Thesis Overview   4/33
Outline
                     One-page Summary of my work
                                        Background     MDPs
                          A Walkthrough of My work     RL Prediction
LAM-API: Large-scale Off-policy Learning and Control
                 Citation Count Prediction using RL



MDPs
  An MDP is defined by a 5-tuple, γ, S, A, (P a )a∈A , (Ra )a∈A .

                                     P a (s′ |s) = P0 (s′ |s, a)


                                         Ra : S × S → R
  (P a )a∈A and (Ra )a∈A are called the model or
  transition-dynamics.
  A policy, π : S × A → [0, 1], selects actions at states.                      Think about a
  policy as a way of how you act everyday.



                                     Hengshuai Yao     Thesis Overview   5/33
Outline
                     One-page Summary of my work
                                        Background     MDPs
                          A Walkthrough of My work     RL Prediction
LAM-API: Large-scale Off-policy Learning and Control
                 Citation Count Prediction using RL



My MDP example
                                                                                                   t=0
                                                                         policy π                         University
                                                                                                          of Alberta

                                                                                                  1.0
  S: UofA, EDM, HK, Paomadi,                                             t=1,r=$-100

  Noah                                                                                 Edmonton
                                                                                        Airport

  A: the set of the links                                                         1.0
                                                                                                                t=3, r_{horse}
  (P a )a∈A : deterministic                                                                        0.1
  Ra (s, s′ ) = r(s′ ),                                                             HK
                                                                                  Airport

  π(U A, Edmonton) = 1,                                                                                                1.0
                                                                         t=2,r=$-1,000            0.9
  π(HK, N oah) = 0.9,
  π(HK, P aomadi) = 0.1; etc.

                                                                                                        t=3, r=$10,000




                                     Hengshuai Yao     Thesis Overview    6/33
Outline
                     One-page Summary of my work
                                        Background         MDPs
                          A Walkthrough of My work         RL Prediction
LAM-API: Large-scale Off-policy Learning and Control
                 Citation Count Prediction using RL

  Value function
                                           ∞
                      V π (s) = E                γ t rt+1 | s0 = s, at ∼ π(st , ·)
                                           t=0

  Optimal policy
                                       ∗
                      V ∗ (s) = V π (s) = max V π (s),                for all s ∈ S.
                                                       π


       V π (U of A) = −100 − 1000γ + γ 2 (0.1 × (−1000) + 0.9 × 10, 000)
  If rhorse = −1, 000:
             V ∗ (U of A) = −100 − 1000γ + γ 2 (                       1.0          ×10, 000)
                                                               =π ∗ (HK,N oah)

  rhorse = 1, 000, 000:
        V ∗ (U of A) = −100 − 1000γ + γ 2 (                          1.0            ×1, 000, 000)
                                                           =π ∗ (HK,P aomadi)
                                     Hengshuai Yao         Thesis Overview   7/33
Outline
                      One-page Summary of my work
                                         Background        MDPs
                           A Walkthrough of My work        RL Prediction
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



MDPs cont’–Dynamic programming
   Bellman equation. For all s ∈ S, for any policy π, one-step
   look-ahead:

                           V π (s) = r(s) + γES ′ ∼π(s,·) [V π (S ′ )],
                                     ¯

   where r(s) = s′ P π (s, s′ )r(s, s′ ); r(s, s′ ) = a∈A P a (s, s′ )Ra (s, s′ ).
          ¯
   Solving V π for an ordinary policy π is called policy evaluation. Simple, power iteration.
              ∗
   Solving V π or π ∗ is called control, usually using value iteration:

                        Vk+1 (s) = max E[rt+1 + γVk (st+1 ) | st = s, at = a]
                                       a

                                  = max          P a (s′ |s)(Ra (s, s′ ) + γVk (s′ ))
                                       a
                                            s′



   Policy iteration is similar.
                                      Hengshuai Yao        Thesis Overview     8/33
Outline
                     One-page Summary of my work
                                        Background     MDPs
                          A Walkthrough of My work     RL Prediction
LAM-API: Large-scale Off-policy Learning and Control
                 Citation Count Prediction using RL



RL
  Features of RL:
       • Sample-based learning. No model.
       • Only intermediate rewards are observed.
       • Partially observable, e.g., citation count prediction.
  Prediction/Control: solving V π (for some π) or V ∗ using samples.
  Sample efficiency and memory is important. Algorithms:
       • TD, Q-learning [Barto et.           al. 83; Sutton 88; Dayan 92; Bertsekas 96]

       • Dyna [Sutton et.       al. 91]   and linear Dyna [Sutton el.al.08].
       • LSTD [Boyan 02], LSPI [Lagoudakis and Parr, 03]
       • GTD [Sutton et.       al. 09; Maie 10 el. al. 10]

                                     Hengshuai Yao     Thesis Overview   9/33
Outline
                      One-page Summary of my work
                                         Background     MDPs
                           A Walkthrough of My work     RL Prediction
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



Prediction
   Feature mapping: φ : S → Rn , n being the number of features.
   Linear Function Approximation (LFA). We approximate V π by
                                          ˆ
                                          V π (s) = φ(s)⊤ θ,

   for s ∈ S, where θ is the parameter vector (to learn).
   Samples (this is our Big Data)

                         D = ( φ(st ), at , rt+1 , φ(st+1 ) )t=1,2,...,T

   Prediction: Given an input policy π, output an estimate of the
   value function V π . We learn a predictor on D using φ.
   On-policy: D is created by following π.
   Off-policy: D is not created by π.
                                      Hengshuai Yao     Thesis Overview   10/33
Outline
                      One-page Summary of my work
                                         Background        MDPs
                           A Walkthrough of My work        RL Prediction
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



Prediction-cont’-TD (Sutton, 88)
   Given the tuple φ(st ), at , rt+1 , φ(st+1 ) , Temporal Difference
   (TD) learning (without eligibility trace):
                                             ˆ           ˆ
                               δt = rt+1 + γ V (st+1 ) − V (st ),

   δt is called the TD error, which is a sample of the Bellman
   residual:

              E[δt |st = s] = r (s) + γ
                              ¯                                        ˆ           ˆ
                                                           P π (s, s′ )V π (s′ ) − V π (s).
                                                   s′ ∈S


                                           ∆θt ∝ αt δt φ(st )


                                      Hengshuai Yao        Thesis Overview   11/33
Outline
                      One-page Summary of my work
                                         Background
                           A Walkthrough of My work
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



Preconditioning Framework [ICML 08]

   Previously: Issues of step-size, sample efficiency, and sparsity:
   LSTD [Boyan 02], LSPE [Bertsekas et. al. 96, 03, 04], FPKF
   [Van Roy 06], iLSTD [Geramifard et. al. 06, 07].
   Contribution: I proposed a general class of algorithms by
   applying the preconditioning technique in iterative analysis,
   which includes the above mentioned algorithms. I solved all
   these issues in this framework. Empirical results: the step-size
   adaptation learns much quicker; sparsity based storage and
   computation is more efficient.



                                      Hengshuai Yao     Thesis Overview   12/33
Outline
                      One-page Summary of my work
                                         Background
                           A Walkthrough of My work
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



Multi-step Linear Dyna [NIPS 09]
   Previously: Online planning is believed to speed up learning
   and makes better decisions (mostly tabular), but “Model-based
   is poorer than model-free”. Dyna [Sutton et. al. 92]/linear Dyna [Sutton
   et.al.08] is an integrated architecture for real time acting, learning, modeling, and
   planning without waiting for each other to complete. However, linear Dyna was found to
   perform inferior to (model-free) TD learning [Sutton et.al.08].
   Contribution: I improved linear Dyna [Sutton et.al. 08] to
   perform much better than TD. I also extended linear Dyna from
   single-step to multi-step planning, and demonstrated on
   Mountain-car (an under-powered car climbing a mountain) that
   multi-step planning predicts more accurately than single-step.


                                      Hengshuai Yao     Thesis Overview   13/33
Outline
                      One-page Summary of my work
                                         Background
                           A Walkthrough of My work
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



LAM based off-policy learning
[ICML-workshop 10]
   Previously: Off-policy learning is ubiquitous. TD diverges but is
   reasonably fast if it converges. GTD algorithms [Sutton et. al.
   09,10,11] converge but are slow.
   Contribution: I used linear action model based framework for
   off-policy learning. It can learn various policies in parallel from
   a single stream of data, for quick real time decision making.
   Evaluated on two continuous-state, hard control problems. I
   recommend using LAM based off-policy learning in place of
   on-policy learning.

                                      Hengshuai Yao     Thesis Overview   14/33
Outline
                      One-page Summary of my work
                                         Background
                           A Walkthrough of My work
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



Deterministic MDPs[with Csaba]
   Previously. Theory: state aggregation, LFA ; Practice: LFA, and
   neural network
   Contribution: A very general framework for RL with function
   approximation. We propose to view all RL methods as building
   some correspondence MDP, which has a smaller state space
   than the original. We solve the correspondence MDP and lift
   the policies and value functions found there back to the original.
   A few important results are proved (20+ lemmas and
   theorems). This framework is helpful in understanding existing
   algorithms as well as developing new ones.


                                      Hengshuai Yao     Thesis Overview   15/33
Outline
                      One-page Summary of my work
                                         Background
                           A Walkthrough of My work
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



Reinforcement Ranking [with Dale]

   Bellman equation looks familiar to you? PageRank? Stationary
   distribution?
   Contribution: We proposed a framework of discovering
   authorities using rewards defined for pages/links. No stationary
   distribution at all, but still guaranteed to converge. Evaluation is
   performed on Wikipedia, DBLP and WebSpamUK. Compared
   the precision and recall with PageRank and TrustRank.
   Promising results on Wikipedia and DBLP.



                                      Hengshuai Yao     Thesis Overview   16/33
Outline
                      One-page Summary of my work
                                         Background
                           A Walkthrough of My work
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



Universal Option Models [with R.C.S]
   Previously:         Options are used to describe high-level decisions. The execution of an
                                      Traditional option models
   option is a sequence of actions (abstraction).
   consist in a reward part and and a state-prediction part. Very
   inefficient for multiple reward functions (or reward function
   changes dynamically).
   Contribution: We proposed a new way of modeling
   options—removing the reward part but adding a state
   occupancy part. We proved that, (a) given any reward function,
   we can construct the return of the option from the new model;
   (b) with the new model we can recover the TD solution without
   any planning computation. On a simulated Star-craft 2 game, it
   is much more efficient for planning than the traditional model.
   Very suitable for large real-time games.
                                      Hengshuai Yao     Thesis Overview   17/33
Outline
                      One-page Summary of my work
                                         Background
                           A Walkthrough of My work
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



LAM-API [AAAI 12]
   Previous API solutions (experience replay [Lin 92], LSPI [Lagoudakis and Parr 03])
   have to remember the sample set D and sweep all the samples in each iteration. D
   can be very large in practice.
   Key idea: Summarize your Big Data with a model. Work with
   the model.
   First, we learn a linear model, F a , f a for each action a, from the samples. For a
   given action a and any given state s ∈ S, with s′ ∼ P a (s, ·), we expect that

                       F a φ(s) ≈ E[φ(s′ )] and (f a )⊤ φ(s) ≈ E[Ra (s, s′ )].

    Complexity of modeling: O(T n2 ).
   Second, we use all the LAMs to perform API. Complexity: O(Ln2 Niter )
   LAM-API: O(T n2 ) + O(|L|n2 Niter )
   LSPI: O(T n2 Niter )
   Big Data: T ≫ |L|, which means,
            LAM-API—O(T n2 ) ≪ O(T n2 Niter )—LSPI

                                      Hengshuai Yao     Thesis Overview   18/33
Outline
                      One-page Summary of my work
                                         Background
                           A Walkthrough of My work
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



LAM-API-cont’
   Algorithm 1 LAM-API with LSTD
   Input: a list of features L = {φi }, a LAM ( F a , f a )a∈A . Output: a weight vector θ.
   Initialize θ
   repeat for Niter iterations
     for φi from L do
           Select greedy action:
             a∗ = arg maxa {(f a )⊤ φi + γθ ⊤ F a φi }
           Select model:
                        ∗        ∗
             F ∗ = F a , f∗ = fa
           Produce prediction for the next feature vector and reward:
             ˜
             φi+1 = F ∗ φi
             ri = (f ∗ )⊤ φi
             ˜
           Accumulate LSTD structures:
                             ˜
             A = A + φi (γ φi+1 − φi )⊤
             b = b + φi r i
                          ˜
     θ = −A−1 b

                                      Hengshuai Yao     Thesis Overview   19/33
Outline
                      One-page Summary of my work
                                         Background
                           A Walkthrough of My work
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



LAM-API-cont’
   Compared learning quality with LSPI. L = {φi | φi ← Di }.
   Chain-walk problems. Left: 4-state chain. Right: LAM-LSTD on the 50-state chain.
   LAM-LSTD converges in 4 iterations. At iteration 2, the policy is already optimal.


                                                                                                 Iter. 0#: all actions are ’R’

                                                                           0.1
                                                                    0.05
                                                                            0
                                                          −0.05
                                                                                  5    10   15      20       25      30      35   40   45   50
                                                                           Iter. 1#: RRRRRRRRRRRLLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRLLLLLL




                                                         Value Functions
                                                                           0.6
                                                                           0.4
                                                                           0.2
                                                                            0
                                                                                  5    10   15      20       25      30      35   40   45   50
                                                                           Iter. 2#: RRRRRRRRRLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRLLLLLLLLL

                                                                           0.6
                                                                           0.4
                                                                           0.2
                                                                            0
                                                                                  5    10   15      20      25     30        35   40   45   50
                                                                                                         State No#




                                      Hengshuai Yao     Thesis Overview                                  20/33
Outline
                      One-page Summary of my work
                                         Background
                           A Walkthrough of My work
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



LAM-API-cont’
   LSPI converges in 14 iterations; found the optimal policy at the
   last iteration
                                               Iter. 0#: all actions are ’R’                                  Iter. 8#: RRRRLLRRRRRRLLLLLLLRRRLLLLLRRRRRRLLLLRRRRRRRLLLLLL
                          1                                                                                      5
                                                                                                                 0
                          0                                                                                    −5
                                                                                                              −10
                  −1                                                                                          −15
                               5     10   15      20       25      30      35    40   45   50                            5   10   15   20   25    30   35   40   45  50

                          Iter. 1#: LLLLLRRRRRRLLLLLLLLLLLLLLLLLLLLLLLRRRRRRRRRLLLLLLR                               Iter. 9#: LRRRRRRRRRRLLLLLLLLLLLLLLLLLLLLLRRRRRRRRRLLLLLLLLR




                                                                                                   Value Functions
        Value Functions




                          2
                                                                                                                      2
                          1
                                                                                                                      0
                          0
                                                                                                                     −2
                               5     10   15      20       25      30      35    40   45   50                              5    10    15   20     25     30    35   40   45   50

             Iter. 7#: LRRRRRRRRLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRLLLLLLLRR                                     Iter. 14#: RRRRRRRRRLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRLLLLLLLLL

                          2                                                                                           2

                          1                                                                                           1

                          0                                                                                           0
                               5     10   15      20      25     30        35    40   45   50                              5    10    15   20      25     30   35   40   45   50
                                                       State No#                                                                                State No#



                              (c) LSPI: iteration 0,1,7                                                               (d) LSPI: iteration 8, 9, 14

                                                                                Hengshuai Yao   Thesis Overview                            21/33
Outline
                      One-page Summary of my work
                                         Background
                           A Walkthrough of My work
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



LAM-API-cont’-Cart-Pole
   Goal: Keep the pendulum above horizontal. (for a maximum of 3000 steps).
   Reward: binary; State: angle and angular velocity (both continuous)

                                                                           3000


                                                                           2500
                                                                                                                  LSPI, average
                                                                                           LAM−LSTD/LSPI, best
                                                                           2000




                                                         #Balanced steps
                                                                                        LAM−LSTD, average
                                                                           1500
                                                                                                                  LSPI, worst

                                                                           1000         LAM−LSTD, worst


                                                                           500


                                                                             0
                                                                              0   200        400          600          800        1000
                                                                                             #Training Episodes

     (e) Cart-pole balancing                                                      (f) Balancing steps

   Why important? LSPI [Lagoudakis and Parr 03] widely used; “LSPI is arguably the
   most competitive RL algorithm available in large environments.” [Li et. al. 09].
                                      Hengshuai Yao     Thesis Overview                       22/33
Outline
                      One-page Summary of my work
                                         Background
                           A Walkthrough of My work
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



Citation Count Prediction
   Citation count: the most used measure in academics.
   Predicting it is interesting. We studied the prediction of the
   citation count for papers.
   Previously, [Yan et. al. 11] [Fu 08] studied a citation count
   prediction problem using SL.
   Training (spacial):

                               Input → Output
          x: feature vectors in 1990 → y: citation counts until 2000

   Given a paper’s features in 1990—predict.
   Now given a paper’s features in 2000—? (a temporal aspect)

                                      Hengshuai Yao     Thesis Overview   23/33
Outline
                      One-page Summary of my work
                                         Background
                           A Walkthrough of My work
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



Citation Count Prediction–cont’
   Citation count prediction is temporal.
   Problem formulation. Define the “value” of a paper p at year t by
   the sum of discounted numbers of citations in all the
   subsequent years:
                                               ∞
                              V (p, t) =            γ q−t cq ,   γ ∈ [0, 1)
                                              q=t

   where cq is the number of citations the paper receives in year q.
   When t is the publication year of the paper and γ approaches one, V (p, t) will be
   virtually close to the total number of citation counts for the paper.
   Question: What is my state here?          st = (p, t)
                                      Hengshuai Yao       Thesis Overview   24/33
Outline
                      One-page Summary of my work
                                         Background
                           A Walkthrough of My work
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



Citation Count Prediction–cont’
   We represent by
                                        ˆ
                                        V (p, t) = φ(p, t)⊤ θ.


   Samples: a data set,

     D = ∪p∈P Dp ;           Dp = ( φ(p, t), ct+1 , φ(p, t + 1) )t=1990,1991,...,2000

   Features. φ(p, t): a vector, having entries for, e.g., the number
   of citations for each author till year t, the number of citations for
   the venue till year t, etc.


                                      Hengshuai Yao     Thesis Overview   25/33
Outline
                      One-page Summary of my work
                                         Background
                           A Walkthrough of My work
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



Citation Count Prediction–cont’
   Long term: predict more than 10 years.                   TD and LSTD
   Short term: predict in k (k < 10) years
        • not a standard RL problem
        • We extended LAM to this context and proposed a model-based
          prediction method.

   Key idea: learn a model F, f from year-to-year status change of
   papers. Given φ(p, t), ct+1 , φ(p, t + 1) , update

                          ∆F = α [φ(p, t + 1) − F φ(p, t)] φ(p, t)⊤ ,

   and
                               ∆f = α ct+1 − f ⊤ φ(p, t) φ(p, t).

                                      Hengshuai Yao     Thesis Overview   26/33
Outline
                      One-page Summary of my work
                                         Background
                           A Walkthrough of My work
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



Citation Count Prediction–cont’
   What we learned?
   f : a one-year predictor
   F : multiple one-year predictors


                                   #CC of first author                            #CC of first author



                                   #CC of last author                            #CC of last author


                                   my #CC of in the                              my #CC of in the
                                    last few years                                last few years
                                          ... ...




                                                                                        ... ...
                                   #CC of the citing                             #CC of the citing
                                       papers                                        papers

                                                        Linear transient model
                                         2000                                          2001




                                       Hengshuai Yao                  Thesis Overview                 27/33
Outline
                      One-page Summary of my work
                                         Background
                           A Walkthrough of My work
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



Citation Count Prediction–cont’
   How do we use the model?
   Given: the feature vector of a paper s at a year t = 2012, φ(s2012 )
   citation count in 2013: c1 = f ⊤ φ(s2012 ).
                            ˆ
   citation count in 2014: We need φ(s2013 ) (unavailable). We can
                          ˆ    def
   predict the features: φ2013 = F φ(s2012 ). Then we combine f again to
   predict by
                       c2 =
                       ˆ                  ˆ
                                      f ⊤ φ2013
                                      Using a prediction to predict
   This generalizes the key idea of TD, linear Dyna [Sutton et.al. 08],
   LAM-API to more general multi-step prediction problems.
   Similarly we can extrapolate into more years into the future.


                                      Hengshuai Yao     Thesis Overview   28/33
Outline
                      One-page Summary of my work
                                         Background
                           A Walkthrough of My work
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



Citation Count Prediction–Empirical
   “Now” is 2002.
   Training data: the citation counts of 7K papers from 1990 to “Now”.
   Test data: their citation counts from “Now” to 2012.
   Algorithms: LS/SVR, LSTD
                                        LS (training)                                               LS (predicting future)
                    600                                                              5000

                    500
                                                                                     4000

                    400
                                                                        Prediction
       Prediction




                                                                                     3000
                    300
                                                                                     2000
                    200

                                                                                     1000
                    100

                      0                                                                 0
                       0   100    200      300     400     500   600                     0   100   200   300 400       500   600   700
                                        True Value                                                       True Value
                                 (g) LS-train                                                      (h) LS-test
                                                        Hengshuai Yao         Thesis Overview            29/33
Outline
                      One-page Summary of my work
                                         Background
                           A Walkthrough of My work
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



Citation Count Prediction–Empirical
   LSTD successfully generalizes over time for the training papers.

                                                                   LSTD
                                             800


                                             600
                                Prediction




                                             400


                                             200


                                              0
                                               0           200         400      600
                                                                 True Value
                                                            (i) LS-test



                                                   Hengshuai Yao       Thesis Overview   30/33
Outline
                      One-page Summary of my work
                                         Background
                           A Walkthrough of My work
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



Citation Count Prediction–Empirical
   Predicting for test papers (newer than the training papers)
   Papers are marked according to year of publication. black, green, blue, red, magenta:
                    •    in Left plot, correspond to papers published in 1990, 1991, 1992, 1993, 1994
                    •    in Middle plot, correspond to papers published in 1995, 1996, 1997, 1998, 1999
   In Right plot: black, green—papers published in 2000, 2001.
   True citation count: cross (+) marked; Prediction: star (*) marked
   LSTD successfully generalizes over time for new papers

                                      LSTD(0)                                                      LSTD(0)                                                LSTD(0)
                 1200                                                               900                                                   800
                                                                                    800                                                   700
                 1000
                                                                                    700                                                   600
                  800                                                               600
                                                                                                                                          500
    Prediction




                                                                       Prediction




                                                                                                                             Prediction
                                                                                    500
                  600                                                                                                                     400
                                                                                    400
                                                                                                                                          300
                  400                                                               300
                                                                                    200                                                   200
                  200
                                                                                    100                                                   100

                    0                                                                 0                                                     0
                     0    200   400      600     800   1000   1200                     0     200   400        600   800                      0      200      400       600   800
                                      True Value                                                   True Value                                             True Value
                         (j) papers 90-94                                                  (k) papers 95-99                                      (l) papers 00-01

                                                                     Hengshuai Yao                         Thesis Overview                 31/33
Outline
                      One-page Summary of my work
                                         Background
                           A Walkthrough of My work
 LAM-API: Large-scale Off-policy Learning and Control
                  Citation Count Prediction using RL



Citation Count Prediction–Empirical
   Short-term prediction: the performance of the proposed method

                                 Dyna: 4−year prediction
                    400                                                    35

                    350                                                    30
                                                                                            paper−2000−2001
                    300                                                    25
                    250                                                                                 paper−1995−2000
       Prediction




                                                                           20




                                                                    RMSE
                    200
                                                                           15                           paper−1990−1995
                    150
                                                                           10
                    100
                                                                            5                        paper−before1990
                     50

                      0                                                     0
                       0        100       200         300    400             1   2    3     4       5       6   7       8
                                       True Value                                      Years into the future
                           (m) 4 year; Papers 00-01                                  (n) Summary



                                                    Hengshuai Yao   Thesis Overview         32/33
Outline
                     One-page Summary of my work
                                        Background
                          A Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and Control
                 Citation Count Prediction using RL




                                 Thank You!



                                     Hengshuai Yao     Thesis Overview   33/33

More Related Content

Similar to Hengshuai noah

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Jack Clark
 
A brief introduction to 'R' statistical package
A brief introduction to 'R' statistical packageA brief introduction to 'R' statistical package
A brief introduction to 'R' statistical package
Shanmukha S. Potti
 
Formal methods 5 - Pi calculus
Formal methods   5 - Pi calculusFormal methods   5 - Pi calculus
Formal methods 5 - Pi calculus
Vlad Patryshev
 
Algorithm of Dynamic Programming for Paper-Reviewer Assignment Problem
Algorithm of Dynamic Programming for Paper-Reviewer Assignment ProblemAlgorithm of Dynamic Programming for Paper-Reviewer Assignment Problem
Algorithm of Dynamic Programming for Paper-Reviewer Assignment Problem
IRJET Journal
 
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlow
Illia Polosukhin
 
Query compiler
Query compilerQuery compiler
Query compiler
Digvijay Singh
 
SparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data ScientistsSparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data Scientists
DataWorks Summit
 
SparkR best practices for R data scientist
SparkR best practices for R data scientistSparkR best practices for R data scientist
SparkR best practices for R data scientist
DataWorks Summit
 
Preemptive RANSAC by David Nister.
Preemptive RANSAC by David Nister.Preemptive RANSAC by David Nister.
Preemptive RANSAC by David Nister.
Ian Sa
 
LFA-NPG-Paper.pdf
LFA-NPG-Paper.pdfLFA-NPG-Paper.pdf
LFA-NPG-Paper.pdf
harinsrikanth
 
Personalised Search for the Social Semantic Web
Personalised Search for the Social Semantic WebPersonalised Search for the Social Semantic Web
Personalised Search for the Social Semantic Web
Oana Tifrea-Marciuska
 
relational algebra and it's implementation
relational algebra and it's implementationrelational algebra and it's implementation
relational algebra and it's implementation
dbmscse61
 

Similar to Hengshuai noah (12)

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
A brief introduction to 'R' statistical package
A brief introduction to 'R' statistical packageA brief introduction to 'R' statistical package
A brief introduction to 'R' statistical package
 
Formal methods 5 - Pi calculus
Formal methods   5 - Pi calculusFormal methods   5 - Pi calculus
Formal methods 5 - Pi calculus
 
Algorithm of Dynamic Programming for Paper-Reviewer Assignment Problem
Algorithm of Dynamic Programming for Paper-Reviewer Assignment ProblemAlgorithm of Dynamic Programming for Paper-Reviewer Assignment Problem
Algorithm of Dynamic Programming for Paper-Reviewer Assignment Problem
 
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlow
 
Query compiler
Query compilerQuery compiler
Query compiler
 
SparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data ScientistsSparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data Scientists
 
SparkR best practices for R data scientist
SparkR best practices for R data scientistSparkR best practices for R data scientist
SparkR best practices for R data scientist
 
Preemptive RANSAC by David Nister.
Preemptive RANSAC by David Nister.Preemptive RANSAC by David Nister.
Preemptive RANSAC by David Nister.
 
LFA-NPG-Paper.pdf
LFA-NPG-Paper.pdfLFA-NPG-Paper.pdf
LFA-NPG-Paper.pdf
 
Personalised Search for the Social Semantic Web
Personalised Search for the Social Semantic WebPersonalised Search for the Social Semantic Web
Personalised Search for the Social Semantic Web
 
relational algebra and it's implementation
relational algebra and it's implementationrelational algebra and it's implementation
relational algebra and it's implementation
 

Hengshuai noah

  • 1. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL My RL Approach to Prediction and Control Hengshuai Yao University of Alberta April 4, 2013 Hengshuai Yao Thesis Overview 1/33
  • 2. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL Outline • One-page Summary of my work (30 seconds) • Background on Reinforcement Learning (RL) (8 slides; 6 minutes) • Walkthrough my work (6 slides, 4 minutes) • LAM-API: Large-scale Off-policy Learning and Control (5 slides; 5 minutes) • Citation count prediction using RL (10 slides; 8 minutes) Hengshuai Yao Thesis Overview 2/33
  • 3. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL Summary of my work Prediction • A framework for existing prediction algorithms [ICML 08] • Data efficiency for on-policy prediction (Multi-step linear Dyna [NIPS 09]) and off-policy prediction (LAM-off-policy [ICML-workshop 10]) Control • Memory and sample efficiency for control (LAM-API [AAAI 12]) • Online abstract planning with Universal Option Models [in preparation for JAIR with Csaba, Rich and Shalabh] • RL with general function approximation. Deterministic MDPs [in preparation for Machine Learning Journal with Csaba] RL applications • Citation count prediction[submitted to IEEE Trans. on SMC-part B] • Ranking [current work with Dale] Hengshuai Yao Thesis Overview 3/33
  • 4. Outline One-page Summary of my work Background MDPs A Walkthrough of My work RL Prediction LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL Background on RL I will go over • MDPs: Definition, Policies, Value Functions, and more • Prediction Problems (TD, Dyna, On-policy, Off-policy) • The Control Problem (Value Iteration, Q-learning, LSPI) Hengshuai Yao Thesis Overview 4/33
  • 5. Outline One-page Summary of my work Background MDPs A Walkthrough of My work RL Prediction LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL MDPs An MDP is defined by a 5-tuple, γ, S, A, (P a )a∈A , (Ra )a∈A . P a (s′ |s) = P0 (s′ |s, a) Ra : S × S → R (P a )a∈A and (Ra )a∈A are called the model or transition-dynamics. A policy, π : S × A → [0, 1], selects actions at states. Think about a policy as a way of how you act everyday. Hengshuai Yao Thesis Overview 5/33
  • 6. Outline One-page Summary of my work Background MDPs A Walkthrough of My work RL Prediction LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL My MDP example t=0 policy π University of Alberta 1.0 S: UofA, EDM, HK, Paomadi, t=1,r=$-100 Noah Edmonton Airport A: the set of the links 1.0 t=3, r_{horse} (P a )a∈A : deterministic 0.1 Ra (s, s′ ) = r(s′ ), HK Airport π(U A, Edmonton) = 1, 1.0 t=2,r=$-1,000 0.9 π(HK, N oah) = 0.9, π(HK, P aomadi) = 0.1; etc. t=3, r=$10,000 Hengshuai Yao Thesis Overview 6/33
  • 7. Outline One-page Summary of my work Background MDPs A Walkthrough of My work RL Prediction LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL Value function ∞ V π (s) = E γ t rt+1 | s0 = s, at ∼ π(st , ·) t=0 Optimal policy ∗ V ∗ (s) = V π (s) = max V π (s), for all s ∈ S. π V π (U of A) = −100 − 1000γ + γ 2 (0.1 × (−1000) + 0.9 × 10, 000) If rhorse = −1, 000: V ∗ (U of A) = −100 − 1000γ + γ 2 ( 1.0 ×10, 000) =π ∗ (HK,N oah) rhorse = 1, 000, 000: V ∗ (U of A) = −100 − 1000γ + γ 2 ( 1.0 ×1, 000, 000) =π ∗ (HK,P aomadi) Hengshuai Yao Thesis Overview 7/33
  • 8. Outline One-page Summary of my work Background MDPs A Walkthrough of My work RL Prediction LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL MDPs cont’–Dynamic programming Bellman equation. For all s ∈ S, for any policy π, one-step look-ahead: V π (s) = r(s) + γES ′ ∼π(s,·) [V π (S ′ )], ¯ where r(s) = s′ P π (s, s′ )r(s, s′ ); r(s, s′ ) = a∈A P a (s, s′ )Ra (s, s′ ). ¯ Solving V π for an ordinary policy π is called policy evaluation. Simple, power iteration. ∗ Solving V π or π ∗ is called control, usually using value iteration: Vk+1 (s) = max E[rt+1 + γVk (st+1 ) | st = s, at = a] a = max P a (s′ |s)(Ra (s, s′ ) + γVk (s′ )) a s′ Policy iteration is similar. Hengshuai Yao Thesis Overview 8/33
  • 9. Outline One-page Summary of my work Background MDPs A Walkthrough of My work RL Prediction LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL RL Features of RL: • Sample-based learning. No model. • Only intermediate rewards are observed. • Partially observable, e.g., citation count prediction. Prediction/Control: solving V π (for some π) or V ∗ using samples. Sample efficiency and memory is important. Algorithms: • TD, Q-learning [Barto et. al. 83; Sutton 88; Dayan 92; Bertsekas 96] • Dyna [Sutton et. al. 91] and linear Dyna [Sutton el.al.08]. • LSTD [Boyan 02], LSPI [Lagoudakis and Parr, 03] • GTD [Sutton et. al. 09; Maie 10 el. al. 10] Hengshuai Yao Thesis Overview 9/33
  • 10. Outline One-page Summary of my work Background MDPs A Walkthrough of My work RL Prediction LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL Prediction Feature mapping: φ : S → Rn , n being the number of features. Linear Function Approximation (LFA). We approximate V π by ˆ V π (s) = φ(s)⊤ θ, for s ∈ S, where θ is the parameter vector (to learn). Samples (this is our Big Data) D = ( φ(st ), at , rt+1 , φ(st+1 ) )t=1,2,...,T Prediction: Given an input policy π, output an estimate of the value function V π . We learn a predictor on D using φ. On-policy: D is created by following π. Off-policy: D is not created by π. Hengshuai Yao Thesis Overview 10/33
  • 11. Outline One-page Summary of my work Background MDPs A Walkthrough of My work RL Prediction LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL Prediction-cont’-TD (Sutton, 88) Given the tuple φ(st ), at , rt+1 , φ(st+1 ) , Temporal Difference (TD) learning (without eligibility trace): ˆ ˆ δt = rt+1 + γ V (st+1 ) − V (st ), δt is called the TD error, which is a sample of the Bellman residual: E[δt |st = s] = r (s) + γ ¯ ˆ ˆ P π (s, s′ )V π (s′ ) − V π (s). s′ ∈S ∆θt ∝ αt δt φ(st ) Hengshuai Yao Thesis Overview 11/33
  • 12. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL Preconditioning Framework [ICML 08] Previously: Issues of step-size, sample efficiency, and sparsity: LSTD [Boyan 02], LSPE [Bertsekas et. al. 96, 03, 04], FPKF [Van Roy 06], iLSTD [Geramifard et. al. 06, 07]. Contribution: I proposed a general class of algorithms by applying the preconditioning technique in iterative analysis, which includes the above mentioned algorithms. I solved all these issues in this framework. Empirical results: the step-size adaptation learns much quicker; sparsity based storage and computation is more efficient. Hengshuai Yao Thesis Overview 12/33
  • 13. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL Multi-step Linear Dyna [NIPS 09] Previously: Online planning is believed to speed up learning and makes better decisions (mostly tabular), but “Model-based is poorer than model-free”. Dyna [Sutton et. al. 92]/linear Dyna [Sutton et.al.08] is an integrated architecture for real time acting, learning, modeling, and planning without waiting for each other to complete. However, linear Dyna was found to perform inferior to (model-free) TD learning [Sutton et.al.08]. Contribution: I improved linear Dyna [Sutton et.al. 08] to perform much better than TD. I also extended linear Dyna from single-step to multi-step planning, and demonstrated on Mountain-car (an under-powered car climbing a mountain) that multi-step planning predicts more accurately than single-step. Hengshuai Yao Thesis Overview 13/33
  • 14. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL LAM based off-policy learning [ICML-workshop 10] Previously: Off-policy learning is ubiquitous. TD diverges but is reasonably fast if it converges. GTD algorithms [Sutton et. al. 09,10,11] converge but are slow. Contribution: I used linear action model based framework for off-policy learning. It can learn various policies in parallel from a single stream of data, for quick real time decision making. Evaluated on two continuous-state, hard control problems. I recommend using LAM based off-policy learning in place of on-policy learning. Hengshuai Yao Thesis Overview 14/33
  • 15. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL Deterministic MDPs[with Csaba] Previously. Theory: state aggregation, LFA ; Practice: LFA, and neural network Contribution: A very general framework for RL with function approximation. We propose to view all RL methods as building some correspondence MDP, which has a smaller state space than the original. We solve the correspondence MDP and lift the policies and value functions found there back to the original. A few important results are proved (20+ lemmas and theorems). This framework is helpful in understanding existing algorithms as well as developing new ones. Hengshuai Yao Thesis Overview 15/33
  • 16. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL Reinforcement Ranking [with Dale] Bellman equation looks familiar to you? PageRank? Stationary distribution? Contribution: We proposed a framework of discovering authorities using rewards defined for pages/links. No stationary distribution at all, but still guaranteed to converge. Evaluation is performed on Wikipedia, DBLP and WebSpamUK. Compared the precision and recall with PageRank and TrustRank. Promising results on Wikipedia and DBLP. Hengshuai Yao Thesis Overview 16/33
  • 17. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL Universal Option Models [with R.C.S] Previously: Options are used to describe high-level decisions. The execution of an Traditional option models option is a sequence of actions (abstraction). consist in a reward part and and a state-prediction part. Very inefficient for multiple reward functions (or reward function changes dynamically). Contribution: We proposed a new way of modeling options—removing the reward part but adding a state occupancy part. We proved that, (a) given any reward function, we can construct the return of the option from the new model; (b) with the new model we can recover the TD solution without any planning computation. On a simulated Star-craft 2 game, it is much more efficient for planning than the traditional model. Very suitable for large real-time games. Hengshuai Yao Thesis Overview 17/33
  • 18. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL LAM-API [AAAI 12] Previous API solutions (experience replay [Lin 92], LSPI [Lagoudakis and Parr 03]) have to remember the sample set D and sweep all the samples in each iteration. D can be very large in practice. Key idea: Summarize your Big Data with a model. Work with the model. First, we learn a linear model, F a , f a for each action a, from the samples. For a given action a and any given state s ∈ S, with s′ ∼ P a (s, ·), we expect that F a φ(s) ≈ E[φ(s′ )] and (f a )⊤ φ(s) ≈ E[Ra (s, s′ )]. Complexity of modeling: O(T n2 ). Second, we use all the LAMs to perform API. Complexity: O(Ln2 Niter ) LAM-API: O(T n2 ) + O(|L|n2 Niter ) LSPI: O(T n2 Niter ) Big Data: T ≫ |L|, which means, LAM-API—O(T n2 ) ≪ O(T n2 Niter )—LSPI Hengshuai Yao Thesis Overview 18/33
  • 19. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL LAM-API-cont’ Algorithm 1 LAM-API with LSTD Input: a list of features L = {φi }, a LAM ( F a , f a )a∈A . Output: a weight vector θ. Initialize θ repeat for Niter iterations for φi from L do Select greedy action: a∗ = arg maxa {(f a )⊤ φi + γθ ⊤ F a φi } Select model: ∗ ∗ F ∗ = F a , f∗ = fa Produce prediction for the next feature vector and reward: ˜ φi+1 = F ∗ φi ri = (f ∗ )⊤ φi ˜ Accumulate LSTD structures: ˜ A = A + φi (γ φi+1 − φi )⊤ b = b + φi r i ˜ θ = −A−1 b Hengshuai Yao Thesis Overview 19/33
  • 20. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL LAM-API-cont’ Compared learning quality with LSPI. L = {φi | φi ← Di }. Chain-walk problems. Left: 4-state chain. Right: LAM-LSTD on the 50-state chain. LAM-LSTD converges in 4 iterations. At iteration 2, the policy is already optimal. Iter. 0#: all actions are ’R’ 0.1 0.05 0 −0.05 5 10 15 20 25 30 35 40 45 50 Iter. 1#: RRRRRRRRRRRLLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRLLLLLL Value Functions 0.6 0.4 0.2 0 5 10 15 20 25 30 35 40 45 50 Iter. 2#: RRRRRRRRRLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRLLLLLLLLL 0.6 0.4 0.2 0 5 10 15 20 25 30 35 40 45 50 State No# Hengshuai Yao Thesis Overview 20/33
  • 21. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL LAM-API-cont’ LSPI converges in 14 iterations; found the optimal policy at the last iteration Iter. 0#: all actions are ’R’ Iter. 8#: RRRRLLRRRRRRLLLLLLLRRRLLLLLRRRRRRLLLLRRRRRRRLLLLLL 1 5 0 0 −5 −10 −1 −15 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Iter. 1#: LLLLLRRRRRRLLLLLLLLLLLLLLLLLLLLLLLRRRRRRRRRLLLLLLR Iter. 9#: LRRRRRRRRRRLLLLLLLLLLLLLLLLLLLLLRRRRRRRRRLLLLLLLLR Value Functions Value Functions 2 2 1 0 0 −2 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Iter. 7#: LRRRRRRRRLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRLLLLLLLRR Iter. 14#: RRRRRRRRRLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRLLLLLLLLL 2 2 1 1 0 0 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 State No# State No# (c) LSPI: iteration 0,1,7 (d) LSPI: iteration 8, 9, 14 Hengshuai Yao Thesis Overview 21/33
  • 22. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL LAM-API-cont’-Cart-Pole Goal: Keep the pendulum above horizontal. (for a maximum of 3000 steps). Reward: binary; State: angle and angular velocity (both continuous) 3000 2500 LSPI, average LAM−LSTD/LSPI, best 2000 #Balanced steps LAM−LSTD, average 1500 LSPI, worst 1000 LAM−LSTD, worst 500 0 0 200 400 600 800 1000 #Training Episodes (e) Cart-pole balancing (f) Balancing steps Why important? LSPI [Lagoudakis and Parr 03] widely used; “LSPI is arguably the most competitive RL algorithm available in large environments.” [Li et. al. 09]. Hengshuai Yao Thesis Overview 22/33
  • 23. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL Citation Count Prediction Citation count: the most used measure in academics. Predicting it is interesting. We studied the prediction of the citation count for papers. Previously, [Yan et. al. 11] [Fu 08] studied a citation count prediction problem using SL. Training (spacial): Input → Output x: feature vectors in 1990 → y: citation counts until 2000 Given a paper’s features in 1990—predict. Now given a paper’s features in 2000—? (a temporal aspect) Hengshuai Yao Thesis Overview 23/33
  • 24. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL Citation Count Prediction–cont’ Citation count prediction is temporal. Problem formulation. Define the “value” of a paper p at year t by the sum of discounted numbers of citations in all the subsequent years: ∞ V (p, t) = γ q−t cq , γ ∈ [0, 1) q=t where cq is the number of citations the paper receives in year q. When t is the publication year of the paper and γ approaches one, V (p, t) will be virtually close to the total number of citation counts for the paper. Question: What is my state here? st = (p, t) Hengshuai Yao Thesis Overview 24/33
  • 25. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL Citation Count Prediction–cont’ We represent by ˆ V (p, t) = φ(p, t)⊤ θ. Samples: a data set, D = ∪p∈P Dp ; Dp = ( φ(p, t), ct+1 , φ(p, t + 1) )t=1990,1991,...,2000 Features. φ(p, t): a vector, having entries for, e.g., the number of citations for each author till year t, the number of citations for the venue till year t, etc. Hengshuai Yao Thesis Overview 25/33
  • 26. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL Citation Count Prediction–cont’ Long term: predict more than 10 years. TD and LSTD Short term: predict in k (k < 10) years • not a standard RL problem • We extended LAM to this context and proposed a model-based prediction method. Key idea: learn a model F, f from year-to-year status change of papers. Given φ(p, t), ct+1 , φ(p, t + 1) , update ∆F = α [φ(p, t + 1) − F φ(p, t)] φ(p, t)⊤ , and ∆f = α ct+1 − f ⊤ φ(p, t) φ(p, t). Hengshuai Yao Thesis Overview 26/33
  • 27. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL Citation Count Prediction–cont’ What we learned? f : a one-year predictor F : multiple one-year predictors #CC of first author #CC of first author #CC of last author #CC of last author my #CC of in the my #CC of in the last few years last few years ... ... ... ... #CC of the citing #CC of the citing papers papers Linear transient model 2000 2001 Hengshuai Yao Thesis Overview 27/33
  • 28. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL Citation Count Prediction–cont’ How do we use the model? Given: the feature vector of a paper s at a year t = 2012, φ(s2012 ) citation count in 2013: c1 = f ⊤ φ(s2012 ). ˆ citation count in 2014: We need φ(s2013 ) (unavailable). We can ˆ def predict the features: φ2013 = F φ(s2012 ). Then we combine f again to predict by c2 = ˆ ˆ f ⊤ φ2013 Using a prediction to predict This generalizes the key idea of TD, linear Dyna [Sutton et.al. 08], LAM-API to more general multi-step prediction problems. Similarly we can extrapolate into more years into the future. Hengshuai Yao Thesis Overview 28/33
  • 29. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL Citation Count Prediction–Empirical “Now” is 2002. Training data: the citation counts of 7K papers from 1990 to “Now”. Test data: their citation counts from “Now” to 2012. Algorithms: LS/SVR, LSTD LS (training) LS (predicting future) 600 5000 500 4000 400 Prediction Prediction 3000 300 2000 200 1000 100 0 0 0 100 200 300 400 500 600 0 100 200 300 400 500 600 700 True Value True Value (g) LS-train (h) LS-test Hengshuai Yao Thesis Overview 29/33
  • 30. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL Citation Count Prediction–Empirical LSTD successfully generalizes over time for the training papers. LSTD 800 600 Prediction 400 200 0 0 200 400 600 True Value (i) LS-test Hengshuai Yao Thesis Overview 30/33
  • 31. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL Citation Count Prediction–Empirical Predicting for test papers (newer than the training papers) Papers are marked according to year of publication. black, green, blue, red, magenta: • in Left plot, correspond to papers published in 1990, 1991, 1992, 1993, 1994 • in Middle plot, correspond to papers published in 1995, 1996, 1997, 1998, 1999 In Right plot: black, green—papers published in 2000, 2001. True citation count: cross (+) marked; Prediction: star (*) marked LSTD successfully generalizes over time for new papers LSTD(0) LSTD(0) LSTD(0) 1200 900 800 800 700 1000 700 600 800 600 500 Prediction Prediction Prediction 500 600 400 400 300 400 300 200 200 200 100 100 0 0 0 0 200 400 600 800 1000 1200 0 200 400 600 800 0 200 400 600 800 True Value True Value True Value (j) papers 90-94 (k) papers 95-99 (l) papers 00-01 Hengshuai Yao Thesis Overview 31/33
  • 32. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL Citation Count Prediction–Empirical Short-term prediction: the performance of the proposed method Dyna: 4−year prediction 400 35 350 30 paper−2000−2001 300 25 250 paper−1995−2000 Prediction 20 RMSE 200 15 paper−1990−1995 150 10 100 5 paper−before1990 50 0 0 0 100 200 300 400 1 2 3 4 5 6 7 8 True Value Years into the future (m) 4 year; Papers 00-01 (n) Summary Hengshuai Yao Thesis Overview 32/33
  • 33. Outline One-page Summary of my work Background A Walkthrough of My work LAM-API: Large-scale Off-policy Learning and Control Citation Count Prediction using RL Thank You! Hengshuai Yao Thesis Overview 33/33