Towards a Theoretical
Framework for LCS

Jan Drugowitsch
Dr Alwyn Barry
Overview
Big Question: Why use LCS?

           Aims and Approach
1.
           Function Approximation
2.
           Dynam...
Research Aims

      To establish a formal model for
      Accuracy-based LCS
      To use established models, methods
   ...
Approach
      Reduce LCS to components:
               Function Approximation
           –
               Dynamic Program...
Function Approximation
      Architecture Outline:
               Set of K classifiers, matching states Sk⊆ S
           –...
Function Approximation
      Linear approx. architecture

               Feature vector φ : S → ℜL
           –
          ...
Function Approximation
           Approximation to MSE for k at time t :
                                             I S ...
Function Approximation
      Mixing classifiers
               A classifier k covers Sk ⊆ S
           –
               Ne...
Function Approximation
      Matrix notation
           V = (V (1),L, V ( N ))′                    Dk = I S k D
          ...
Function Approximation
      Achievements
               Proof that the MSE of a classifier k at time t is the
           ...
Function Approximation
      Achievements (cont’d)
               Elaboration of algorithms using the model:
           –
...
Function Approximation
      Achievements (cont’d)
               Investigation:
           –




May 2006                ...
Function Approximation
      How to set the mixing weights ψk(i)?
               Accuracy-based mixing (YCS-like!)
       ...
Function Approximation
      Mixing - Empirical Results
               Higher ν, more importance to low error
           –...
Dynamic Programming
      Dynamic Programming:
                         ∞ t                                              ...
Dynamic Programming
      T is a contraction w.r.t. maximum norm
               Converges to unique fixed point V*= TV*
  ...
Dynamic Programming
      LCS Value Iteration
               Assume:
           –
                       constant populati...
Dynamic Programming
      LCS Value Iteration (cont’d)
               Minimum is orthog. projection on approx:
           ...
Dynamic Programming
      LCS Asynchronous Value Iteration
               Only update Vk for the matched state
           ...
Dynamic Programming
      Also derived:
               LCS Q-Learning
           –
                 LMS Implementation
   ...
Dynamic Programming
      Is LCS function approximation a non-expansion w.r.t.
      maximum norm?
               We deriv...
Classifier Replacement
      How can we define the optimal population
      formally?
               What do we want to re...
Future Work
      Formalising and analysing classifier
      replacement
      Interaction of replacement with function
  ...
Upcoming SlideShare
Loading in …5
×

Towards a Theoretical Towards a Theoretical Framework for LCS Framework for LCS

574
-1

Published on

Alwyn Barry introduces the theoretical framework for LCS that Jan Drugowitsch is currently working on.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
574
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Towards a Theoretical Towards a Theoretical Framework for LCS Framework for LCS

  1. 1. Towards a Theoretical Framework for LCS Jan Drugowitsch Dr Alwyn Barry
  2. 2. Overview Big Question: Why use LCS? Aims and Approach 1. Function Approximation 2. Dynamic Programming 3. Classifier Replacement 4. Future Work 5. May 2006 © A.M.Barry and J. Drugowitsch, 2006 Drugowitsch,
  3. 3. Research Aims To establish a formal model for Accuracy-based LCS To use established models, methods and notation To transfer method knowledge (analysis / improvement) To create links to other ML approaches and disciplines May 2006 © A.M.Barry and J. Drugowitsch, 2006 Drugowitsch,
  4. 4. Approach Reduce LCS to components: Function Approximation – Dynamic Programming – Classifier Replacement – Model the components Model their interactions Verify models by empirical studies Extend models from related domains May 2006 © A.M.Barry and J. Drugowitsch, 2006 Drugowitsch,
  5. 5. Function Approximation Architecture Outline: Set of K classifiers, matching states Sk⊆ S – Every classifier k approximates the value function – V over Sk. independently. Assumptions Function V : S → ℜ is time-invariant – Time-invariant set of classifiers {S1, …, SK} – Fixed sampling dist. ∏(S) given by π : S → [0, 1] – ~ Minimising MSE: ∑ π (i )(V (i ) − Vk (i )) 2 – i∈S k May 2006 © A.M.Barry and J. Drugowitsch, 2006 Drugowitsch,
  6. 6. Function Approximation Linear approx. architecture Feature vector φ : S → ℜL – Approximation parameters vector ωk ∈ ℜL – ~ Approximation V k = φ (i )T ωk – Averaging classifiers, φ (i) = (1), ∀i∈S Linear estimators, φ (i) = (1, i)′, ∀i∈S May 2006 © A.M.Barry and J. Drugowitsch, 2006 Drugowitsch,
  7. 7. Function Approximation Approximation to MSE for k at time t : I S k (im )(V (im ) − ωk ,tφ (im )) t 1 ∑ ε k ,t ′ = 2 ck , t − 1 m = 0 Matching: I S : S → {0t,1} – k Match count: ck ,t = ∑ I S (im ) – k m =0 Since the sequence of states im is dependant on π, the MSE is automatically weighted by this distribution. May 2006 © A.M.Barry and J. Drugowitsch, 2006 Drugowitsch,
  8. 8. Function Approximation Mixing classifiers A classifier k covers Sk ⊆ S – Need to mix estimates to recover V (i ) ~ – Requires mixing parameter (will be – ‘accuracy’ ... see later!) K K where ψ k : S → {0,1}, ∑ψ k (i ) = 1 ~ V (i ) = ∑ψ k (i )ωkφ (i ) ′ k =1 k =1 ψk(i) = 0 where i ∉ Sk … classifiers that do – not match do not contribute May 2006 © A.M.Barry and J. Drugowitsch, 2006 Drugowitsch,
  9. 9. Function Approximation Matrix notation V = (V (1),L, V ( N ))′ Dk = I S k D  Lφ (1)′L  ~ Vk = Φωk   Φ= L   ψ k(1),K,0  Lφ ( N )′L     Ψk =   O  π (1),K,0   0, K ,ψ ( N )      k D= O   0, K , π ( N )    ~K K ~ V = ∑ ΨkVk = ∑ Ψk Φωk  I S k (1),K ,0    k =1 k =1 I Sk =   O   0, K, I S ( N )     k May 2006 © A.M.Barry and J. Drugowitsch, 2006 Drugowitsch,
  10. 10. Function Approximation Achievements Proof that the MSE of a classifier k at time t is the – normalised sample variance of the value functions over the states matched until time t Proof that the expression developed for sample- – based approximated MSE is a suitable approximation for the actual MSE, for finite state spaces Proof of optimality of approximated MSE from the – vector-space viewpoint (relates classifier error to the Principle of Orthogonality) May 2006 © A.M.Barry and J. Drugowitsch, 2006 Drugowitsch,
  11. 11. Function Approximation Achievements (cont’d) Elaboration of algorithms using the model: – Steepest Gradient Descent Least Mean Square Show sensitivity to step size and input range hold in LCS – Normalised LMS Kalman Filter … and its inverse co-variance form – Equivalence to Recursive Least Squares – Derivation of computationally cheap implementation – For each, derivation of: – Stability criteria For LMS, identification of the excess mean squared – estimation error Time constant bounds May 2006 © A.M.Barry and J. Drugowitsch, 2006 Drugowitsch,
  12. 12. Function Approximation Achievements (cont’d) Investigation: – May 2006 © A.M.Barry and J. Drugowitsch, 2006 Drugowitsch,
  13. 13. Function Approximation How to set the mixing weights ψk(i)? Accuracy-based mixing (YCS-like!) – Divorce mixing for fn. approx. from fitness – I S k (i )ε k−ν ψ k (i ) = K ∑ I S p (i )ε pν − p =1 Investigating power factor ν ∈ℜ+ – Demonstrate, using the model, that we – obtain the MLE of the mixed classifiers when ν = 1 May 2006 © A.M.Barry and J. Drugowitsch, 2006 Drugowitsch,
  14. 14. Function Approximation Mixing - Empirical Results Higher ν, more importance to low error – Foundν = 1 best compromise – Classifiers: 0..π/2, π/2.. π, 0.. π Classifiers: 0..0.5, 0.5.. 1, 0.. 1 May 2006 © A.M.Barry and J. Drugowitsch, 2006 Drugowitsch,
  15. 15. Dynamic Programming Dynamic Programming: ∞ t  V (i ) = max E ∑ γ r (it , at , it +1 ) | i0 = i, at = µ (it )  * µ  t =0  Bellman’s Equation: ( ) V * (i ) = max E r (i, a, j ) + γV * ( j ) | j = p (i, a) a∈ A Applying the DP operator T to value est. (TV )(i ) = max E(r (i, a, j ) + γV ( j ) | j = p (i, a ) ) a∈ A May 2006 © A.M.Barry and J. Drugowitsch, 2006 Drugowitsch,
  16. 16. Dynamic Programming T is a contraction w.r.t. maximum norm Converges to unique fixed point V*= TV* – With function approximation: Approximation after every update: – () ~ ~ Vt +1 = f TVt Might not converge for more complex – approximation architectures Is LCS function approximation a non- expansion w.r.t. maximum norm? May 2006 © A.M.Barry and J. Drugowitsch, 2006 Drugowitsch,
  17. 17. Dynamic Programming LCS Value Iteration Assume: – constant population averaging classifiers, φ (i) = (1), ∀i∈S ~ Vt +1 = TVt Classifier is approximating: – Based on minimising MSE: – ( ) ~ ~ ~~ 2 2 ∑ Vt +1 (i) − Vk ,t +1 (i) = Vt +1 − Vk ,t +1 2 = TVt − Vk ,t +1 I Sk I Sk i∈S k May 2006 © A.M.Barry and J. Drugowitsch, 2006 Drugowitsch,
  18. 18. Dynamic Programming LCS Value Iteration (cont’d) Minimum is orthog. projection on approx: – ~ ~ Vk ,t +1 = Π I Sk Vt +1 = Π I Sk TVt Need also to determine new mixing weight – by evaluating the new approx. error: ( ) 1 1 ~ ~ 2 2 ε k ,t +1 = Vt +1 − Vk ,t +1 = I − Π Dk TVt Tr( I S k ) Tr( I S k ) I Sk I Sk ~ – ∴the only time dependency is on Vt K ~ ~ – Thus: V = ∑ Ψk ,t +1Π I Sk TVt t +1 k =1 May 2006 © A.M.Barry and J. Drugowitsch, 2006 Drugowitsch,
  19. 19. Dynamic Programming LCS Asynchronous Value Iteration Only update Vk for the matched state – Minimise: ∑ I S (im )((TVm )(im ) − ωk′ ,tφ (im ) )2 t ~ – k m =0 Thus, approx. costs weighted by state dist – ( ) t ~ ~ 2 ∑ 2 π (i ) (TV )(i ) − ωkφ (i ) = TV − Φωk ′ Dk i∈S k ~ ~ ∴Vk ,t +1 = Π Dk TVt K ~ ~ Vt +1 = ∑ Ψk ,t +1Π Dk TVt k =1 May 2006 © A.M.Barry and J. Drugowitsch, 2006 Drugowitsch,
  20. 20. Dynamic Programming Also derived: LCS Q-Learning – LMS Implementation Kalman Filter Implementation Demonstrate that the weight update uses the normalised form of local gradient descent Model-K based Policy Iteration – ~ ~ Vt +1 = ∑ Ψk Π Dk TµVt k =1 Step-wise Policy Iteration – TD(λ) is not possible in LCS due to re- – evaluation of mixing weights May 2006 © A.M.Barry and J. Drugowitsch, 2006 Drugowitsch,
  21. 21. Dynamic Programming Is LCS function approximation a non-expansion w.r.t. maximum norm? We derive that: – Constant Mixing: Yes Arbitrary Mixing: Not necessarily Accuracy-based Mixing: non-expansion (dependant upon a conjecture) Is LCS function approximation a non-expansion w.r.t. weighted norm? We know Tµ is a contraction – We show ΠDk is a non-expansion – – We demonstrate: Single Classifier: convergence to a fixed pt Disjoint Classifiers: convergence to a fixed pt Arbitrary Mixing: spectral radius can be ≥ 1 even in some fixed weightings May 2006 © A.M.Barry and J. Drugowitsch, 2006 Drugowitsch,
  22. 22. Classifier Replacement How can we define the optimal population formally? What do we want to reach? – How does the global fitness of one classifier differ – from its local fitness (i.e. accuracy)? What is the quality of a population? – Methodology Examine methods to define the optimal – population Relate to generalised hill-climbing – Map to other search criteria – May 2006 © A.M.Barry and J. Drugowitsch, 2006 Drugowitsch,
  23. 23. Future Work Formalising and analysing classifier replacement Interaction of replacement with function approximation Interaction of replacement with DP Handling stochastic environments Error bounds and rates of convergence … Big Answer: LCS is an integrative framework, … But we need the framework definition! May 2006 © A.M.Barry and J. Drugowitsch, 2006 Drugowitsch,
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×