DIRECT POLICY SEARCH0. What is Direct Policy Search ?1. Direct Policy Search:   Parametric Policies for Financial Applicat...
First, you need to know what is              direct policy search (DPS).                  Principle of DPS: (1) Define a p...
Strengths of DPS:- Good warm start     If I have a solution for problem A, and     if I switch to problem B close to A, th...
Virtual MashDecision computeDecision(MashState & state,             Const Vector<double> params)                ==> “param...
Direct Policy Search:Parametric Policies for Financial          Application
Bengio et al papers on DPS for financial applications       Stocks (various assets) + Cash              - Can be applied o...
An alternate solution:parametric Bellman values   for Stock Problems
What is a Bellman function ?V(s): expected benefit, in the future,  if playing optimally from state s.V(s) is useful for p...
Rule for an optimal decision:  d(s) = argmax V(s) + r(s,d)            d- s=nextState(s,d)- d(s): optimal decision in state...
Remark 1: V(s) knownup to an additive constant is enough       Remark 2: dV(s)/d(si)       is the price of stock i  Exampl...
Q-rule for an optimal decision:      d(s) = argmax Q(s,d)                d- d(s): optimal decision in state s- Q(s,d) : op...
I have enough                                             stock;                                        I pay only if itsV...
Examples:For one stock:   - very simple: constant price   - piecewise linear (can ensure convexity)   - “tanh” function   ...
How to choose coefficients ?- dynamic programming: robust, but slow in high dim- direct policy search:     - initializing ...
Conclusions:V: Very convenient representation of policy:   we can view prices.Q: some advantages (model-free models)Yet, l...
Questions (strategic decisions for the DPS):     - start with Adriens policy, improve it, generalize it,           paramet...
Questions (strategic decisions for the DPS):     - start with Adriens policy, improve it, generalize it,           paramet...
Direct Policy Search: Optimization Tools& Optimization Tricks
- Classical tools: Evolution Strategies,   Cross-Entropy, Pso, ...   ==> more or less supposed to be          robust to lo...
Improvements of optimization algorithms:     - active learning: when optimization on scenarios,            choose “good” s...
Roadmap:- default policy for energy management problems:      test, generalize, formalize, simplify...- this default polic...
Upcoming SlideShare
Loading in...5
×

Direct policy search

800

Published on

Direct Policy Search (in short)
+ discussions of applications to stock problems

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
800
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Direct policy search

  1. 1. DIRECT POLICY SEARCH0. What is Direct Policy Search ?1. Direct Policy Search: Parametric Policies for Financial Applications2. Parametric Bellman values for Stock Problems3. Direct Policy Search: Optimization Tools
  2. 2. First, you need to know what is direct policy search (DPS). Principle of DPS: (1) Define a parametric policy Pi with parameters t1,...,tk. (2) maximize (t1,...,tk) → average reward when applying Policy pi(t1,...,tk) on the problem. ==> You must define Pi ==> You must choose a noisy optimization algorithm==> There is a Pi by default (an actor neural network), but its only a default solution (overload it)
  3. 3. Strengths of DPS:- Good warm start If I have a solution for problem A, and if I switch to problem B close to A, then I quickly get good results.- Benefits from expert knowledge on the structure- No constraint on the structure of the objective function- Anytime (i.e. not that bad in restricted time) Drawbacks: - needs structured direct policy search - not directly applicable to partial observation
  4. 4. Virtual MashDecision computeDecision(MashState & state, Const Vector<double> params) ==> “params” = t1,...,tk ==> returns the decision pi(t1,...,tk,state) Does it make sense ? Overload this function, and DPS is ready to work. Well, DPS (somewhere between alpha and beta) might be full of bugs :-)
  5. 5. Direct Policy Search:Parametric Policies for Financial Application
  6. 6. Bengio et al papers on DPS for financial applications Stocks (various assets) + Cash - Can be applied on data sets (no simulator, no elasticity model) decision = tradingUnit(A, prevision(B,data)) because policy has no impact on prices Where:- tradingUnit is designed by human experts - 22 params in first paper- previsions outputs are chosen by human experts - reduced weight sharing- prevision is a neural network- A and B are parameters in other paper ==> ~ 800 parametersThen, (if I understand correctly)B is optimized by LMS (prevision criterion) ==> poor results, little correlation between - there exist much bigger DPS LMS and financial performanceA and B are optimized on the expected return (Sigaud et al., 27 000) (by DPS) ==> much better - nb: noisy optimization
  7. 7. An alternate solution:parametric Bellman values for Stock Problems
  8. 8. What is a Bellman function ?V(s): expected benefit, in the future, if playing optimally from state s.V(s) is useful for playing optimally.
  9. 9. Rule for an optimal decision: d(s) = argmax V(s) + r(s,d) d- s=nextState(s,d)- d(s): optimal decision in state s- V(s): Bellman value in state s- r(s,d): reward associated to decision d in state s
  10. 10. Remark 1: V(s) knownup to an additive constant is enough Remark 2: dV(s)/d(si) is the price of stock i Example with one stock, soon.
  11. 11. Q-rule for an optimal decision: d(s) = argmax Q(s,d) d- d(s): optimal decision in state s- Q(s,d) : optimal future reward if decision = d in s==> approximate Q instead of V==> we dont need r(s,d) nor newState(s,d)
  12. 12. I have enough stock; I pay only if itsV(stock) (in euros) cheap. I need a lot of stock! I accept to pay a lot. Slope = marginal price (euros/KWh) Stock (in kWh)
  13. 13. Examples:For one stock: - very simple: constant price - piecewise linear (can ensure convexity) - “tanh” function - neural network, SVM, sum of Gaussians...For several stocks: - each stock separately - 2-dimensional: V(s1,s2,s3)=V(s1,S)+v(s2,S)+v(s3,S) where S=a1.s1+a2.s2+a3.s3 - neural network, SVM, sum of Gaussians...
  14. 14. How to choose coefficients ?- dynamic programming: robust, but slow in high dim- direct policy search: - initializing coefficients from expert advice - or: supervised machine learning for approximating an expert advice ==> and then optimize
  15. 15. Conclusions:V: Very convenient representation of policy: we can view prices.Q: some advantages (model-free models)Yet, less readable than direct rules.And expensive: we need one optimization for making the decision, for each time step of a simulation. ==> but this optimization can be a simple sort (as a first approximation).Simpler ? Adrien has a parametric strategy for stocks ==> we should see how to generalize it ==> transformation “constants → parameters” ==> DPS
  16. 16. Questions (strategic decisions for the DPS): - start with Adriens policy, improve it, generalize it, parametrize it ? interface with ARM ? - or another strategy ? - or a parametric V function, and we assume we have r(s,d) and newState(s,d) (often true) - or a parametric Q function ? (more generic, unusual but appealing, but neglects some existing knowledge r(s,d) and newState(s,d) )Further work: - finish the validation of Adriens policy on stock (better than random as a policy; better than random as a UCT-Monte-Carlo) - generalize ? variants ? - introduce into DPS, compare to the baseline (neural net) - introduce DPSs result into MCTS
  17. 17. Questions (strategic decisions for the DPS): - start with Adriens policy, improve it, generalize it, parametrize it ? interface with ARM ? - or another strategy ? - or a parametric V function, and we assume we have r(s,d) and newState(s,d) (often true) - or a parametric Q function ? (more generic, unusual but appealing, but neglects some existing knowledge r(s,d) and newState(s,d) )Further work: - finish the validation of Adriens policy on stock (better than random as a policy; better than random as a UCT-Monte-Carlo) - generalize ? variants ? - introduce into DPS, compare to the baseline (neural net) - introduce DPSs result into MCTS
  18. 18. Direct Policy Search: Optimization Tools& Optimization Tricks
  19. 19. - Classical tools: Evolution Strategies, Cross-Entropy, Pso, ... ==> more or less supposed to be robust to local minima ==> no gradient ==> robust to noisy objective function ==> weak for high dimension (but: see locality, next slide)- Hopefully: - good initialization: nearly convex - random seeds: no noise==> NewUoa is my favorite choice - no gradient - can “really” work in high-dimension - update rule surprisingly fast - people who try to show that their algorithm is better than NewUoa suffer a lot in noise-free case
  20. 20. Improvements of optimization algorithms: - active learning: when optimization on scenarios, choose “good” scenarios ==> maybe “quasi-randomization” ? Just choosing a representative sample of scenarios. ==> simple, robust... - local improvement: when a gradient step/update is performed, only update variables concerned by the simulation youve used for generating the update ==> difficult to use in NewUoa
  21. 21. Roadmap:- default policy for energy management problems: test, generalize, formalize, simplify...- this default policy ==> a parametric policy- test in DPS: strategy A- interface DPS with NewUoa and/or others (openDP opt?)- Strategy A: test into MCTS ==> Strategy B==> IMHO, strategy A = good tool for fast readable non-myopic results==> IMHO, strategy B = good for combining A with the efficiency of A for short term combinatorial effects.- Also, validating the partial observation (sounds good).
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×