Direct policy search

922 views
846 views

Published on

Direct Policy Search (in short)
+ discussions of applications to stock problems

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
922
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
7
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Direct policy search

  1. 1. DIRECT POLICY SEARCH0. What is Direct Policy Search ?1. Direct Policy Search: Parametric Policies for Financial Applications2. Parametric Bellman values for Stock Problems3. Direct Policy Search: Optimization Tools
  2. 2. First, you need to know what is direct policy search (DPS). Principle of DPS: (1) Define a parametric policy Pi with parameters t1,...,tk. (2) maximize (t1,...,tk) → average reward when applying Policy pi(t1,...,tk) on the problem. ==> You must define Pi ==> You must choose a noisy optimization algorithm==> There is a Pi by default (an actor neural network), but its only a default solution (overload it)
  3. 3. Strengths of DPS:- Good warm start If I have a solution for problem A, and if I switch to problem B close to A, then I quickly get good results.- Benefits from expert knowledge on the structure- No constraint on the structure of the objective function- Anytime (i.e. not that bad in restricted time) Drawbacks: - needs structured direct policy search - not directly applicable to partial observation
  4. 4. Virtual MashDecision computeDecision(MashState & state, Const Vector<double> params) ==> “params” = t1,...,tk ==> returns the decision pi(t1,...,tk,state) Does it make sense ? Overload this function, and DPS is ready to work. Well, DPS (somewhere between alpha and beta) might be full of bugs :-)
  5. 5. Direct Policy Search:Parametric Policies for Financial Application
  6. 6. Bengio et al papers on DPS for financial applications Stocks (various assets) + Cash - Can be applied on data sets (no simulator, no elasticity model) decision = tradingUnit(A, prevision(B,data)) because policy has no impact on prices Where:- tradingUnit is designed by human experts - 22 params in first paper- previsions outputs are chosen by human experts - reduced weight sharing- prevision is a neural network- A and B are parameters in other paper ==> ~ 800 parametersThen, (if I understand correctly)B is optimized by LMS (prevision criterion) ==> poor results, little correlation between - there exist much bigger DPS LMS and financial performanceA and B are optimized on the expected return (Sigaud et al., 27 000) (by DPS) ==> much better - nb: noisy optimization
  7. 7. An alternate solution:parametric Bellman values for Stock Problems
  8. 8. What is a Bellman function ?V(s): expected benefit, in the future, if playing optimally from state s.V(s) is useful for playing optimally.
  9. 9. Rule for an optimal decision: d(s) = argmax V(s) + r(s,d) d- s=nextState(s,d)- d(s): optimal decision in state s- V(s): Bellman value in state s- r(s,d): reward associated to decision d in state s
  10. 10. Remark 1: V(s) knownup to an additive constant is enough Remark 2: dV(s)/d(si) is the price of stock i Example with one stock, soon.
  11. 11. Q-rule for an optimal decision: d(s) = argmax Q(s,d) d- d(s): optimal decision in state s- Q(s,d) : optimal future reward if decision = d in s==> approximate Q instead of V==> we dont need r(s,d) nor newState(s,d)
  12. 12. I have enough stock; I pay only if itsV(stock) (in euros) cheap. I need a lot of stock! I accept to pay a lot. Slope = marginal price (euros/KWh) Stock (in kWh)
  13. 13. Examples:For one stock: - very simple: constant price - piecewise linear (can ensure convexity) - “tanh” function - neural network, SVM, sum of Gaussians...For several stocks: - each stock separately - 2-dimensional: V(s1,s2,s3)=V(s1,S)+v(s2,S)+v(s3,S) where S=a1.s1+a2.s2+a3.s3 - neural network, SVM, sum of Gaussians...
  14. 14. How to choose coefficients ?- dynamic programming: robust, but slow in high dim- direct policy search: - initializing coefficients from expert advice - or: supervised machine learning for approximating an expert advice ==> and then optimize
  15. 15. Conclusions:V: Very convenient representation of policy: we can view prices.Q: some advantages (model-free models)Yet, less readable than direct rules.And expensive: we need one optimization for making the decision, for each time step of a simulation. ==> but this optimization can be a simple sort (as a first approximation).Simpler ? Adrien has a parametric strategy for stocks ==> we should see how to generalize it ==> transformation “constants → parameters” ==> DPS
  16. 16. Questions (strategic decisions for the DPS): - start with Adriens policy, improve it, generalize it, parametrize it ? interface with ARM ? - or another strategy ? - or a parametric V function, and we assume we have r(s,d) and newState(s,d) (often true) - or a parametric Q function ? (more generic, unusual but appealing, but neglects some existing knowledge r(s,d) and newState(s,d) )Further work: - finish the validation of Adriens policy on stock (better than random as a policy; better than random as a UCT-Monte-Carlo) - generalize ? variants ? - introduce into DPS, compare to the baseline (neural net) - introduce DPSs result into MCTS
  17. 17. Questions (strategic decisions for the DPS): - start with Adriens policy, improve it, generalize it, parametrize it ? interface with ARM ? - or another strategy ? - or a parametric V function, and we assume we have r(s,d) and newState(s,d) (often true) - or a parametric Q function ? (more generic, unusual but appealing, but neglects some existing knowledge r(s,d) and newState(s,d) )Further work: - finish the validation of Adriens policy on stock (better than random as a policy; better than random as a UCT-Monte-Carlo) - generalize ? variants ? - introduce into DPS, compare to the baseline (neural net) - introduce DPSs result into MCTS
  18. 18. Direct Policy Search: Optimization Tools& Optimization Tricks
  19. 19. - Classical tools: Evolution Strategies, Cross-Entropy, Pso, ... ==> more or less supposed to be robust to local minima ==> no gradient ==> robust to noisy objective function ==> weak for high dimension (but: see locality, next slide)- Hopefully: - good initialization: nearly convex - random seeds: no noise==> NewUoa is my favorite choice - no gradient - can “really” work in high-dimension - update rule surprisingly fast - people who try to show that their algorithm is better than NewUoa suffer a lot in noise-free case
  20. 20. Improvements of optimization algorithms: - active learning: when optimization on scenarios, choose “good” scenarios ==> maybe “quasi-randomization” ? Just choosing a representative sample of scenarios. ==> simple, robust... - local improvement: when a gradient step/update is performed, only update variables concerned by the simulation youve used for generating the update ==> difficult to use in NewUoa
  21. 21. Roadmap:- default policy for energy management problems: test, generalize, formalize, simplify...- this default policy ==> a parametric policy- test in DPS: strategy A- interface DPS with NewUoa and/or others (openDP opt?)- Strategy A: test into MCTS ==> Strategy B==> IMHO, strategy A = good tool for fast readable non-myopic results==> IMHO, strategy B = good for combining A with the efficiency of A for short term combinatorial effects.- Also, validating the partial observation (sounds good).

×