Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Make it your business to fight the flu by Deb Group 364 views
- Lydia Matheson, Aston University Li... by CILIP PPRG 815 views
- Kate hall words_on_the_street by CILIP PPRG 537 views
- Capturing the Digital Universe by CILIP PPRG 472 views
- B2B Barter Company - barterforce by dan_iel 502 views
- OfferLink - Turn Occasional Custome... by Cartera Commerce 345 views

922 views

846 views

846 views

Published on

Direct Policy Search (in short)

+ discussions of applications to stock problems

No Downloads

Total views

922

On SlideShare

0

From Embeds

0

Number of Embeds

2

Shares

0

Downloads

7

Comments

0

Likes

1

No embeds

No notes for slide

- 1. DIRECT POLICY SEARCH0. What is Direct Policy Search ?1. Direct Policy Search: Parametric Policies for Financial Applications2. Parametric Bellman values for Stock Problems3. Direct Policy Search: Optimization Tools
- 2. First, you need to know what is direct policy search (DPS). Principle of DPS: (1) Define a parametric policy Pi with parameters t1,...,tk. (2) maximize (t1,...,tk) → average reward when applying Policy pi(t1,...,tk) on the problem. ==> You must define Pi ==> You must choose a noisy optimization algorithm==> There is a Pi by default (an actor neural network), but its only a default solution (overload it)
- 3. Strengths of DPS:- Good warm start If I have a solution for problem A, and if I switch to problem B close to A, then I quickly get good results.- Benefits from expert knowledge on the structure- No constraint on the structure of the objective function- Anytime (i.e. not that bad in restricted time) Drawbacks: - needs structured direct policy search - not directly applicable to partial observation
- 4. Virtual MashDecision computeDecision(MashState & state, Const Vector<double> params) ==> “params” = t1,...,tk ==> returns the decision pi(t1,...,tk,state) Does it make sense ? Overload this function, and DPS is ready to work. Well, DPS (somewhere between alpha and beta) might be full of bugs :-)
- 5. Direct Policy Search:Parametric Policies for Financial Application
- 6. Bengio et al papers on DPS for financial applications Stocks (various assets) + Cash - Can be applied on data sets (no simulator, no elasticity model) decision = tradingUnit(A, prevision(B,data)) because policy has no impact on prices Where:- tradingUnit is designed by human experts - 22 params in first paper- previsions outputs are chosen by human experts - reduced weight sharing- prevision is a neural network- A and B are parameters in other paper ==> ~ 800 parametersThen, (if I understand correctly)B is optimized by LMS (prevision criterion) ==> poor results, little correlation between - there exist much bigger DPS LMS and financial performanceA and B are optimized on the expected return (Sigaud et al., 27 000) (by DPS) ==> much better - nb: noisy optimization
- 7. An alternate solution:parametric Bellman values for Stock Problems
- 8. What is a Bellman function ?V(s): expected benefit, in the future, if playing optimally from state s.V(s) is useful for playing optimally.
- 9. Rule for an optimal decision: d(s) = argmax V(s) + r(s,d) d- s=nextState(s,d)- d(s): optimal decision in state s- V(s): Bellman value in state s- r(s,d): reward associated to decision d in state s
- 10. Remark 1: V(s) knownup to an additive constant is enough Remark 2: dV(s)/d(si) is the price of stock i Example with one stock, soon.
- 11. Q-rule for an optimal decision: d(s) = argmax Q(s,d) d- d(s): optimal decision in state s- Q(s,d) : optimal future reward if decision = d in s==> approximate Q instead of V==> we dont need r(s,d) nor newState(s,d)
- 12. I have enough stock; I pay only if itsV(stock) (in euros) cheap. I need a lot of stock! I accept to pay a lot. Slope = marginal price (euros/KWh) Stock (in kWh)
- 13. Examples:For one stock: - very simple: constant price - piecewise linear (can ensure convexity) - “tanh” function - neural network, SVM, sum of Gaussians...For several stocks: - each stock separately - 2-dimensional: V(s1,s2,s3)=V(s1,S)+v(s2,S)+v(s3,S) where S=a1.s1+a2.s2+a3.s3 - neural network, SVM, sum of Gaussians...
- 14. How to choose coefficients ?- dynamic programming: robust, but slow in high dim- direct policy search: - initializing coefficients from expert advice - or: supervised machine learning for approximating an expert advice ==> and then optimize
- 15. Conclusions:V: Very convenient representation of policy: we can view prices.Q: some advantages (model-free models)Yet, less readable than direct rules.And expensive: we need one optimization for making the decision, for each time step of a simulation. ==> but this optimization can be a simple sort (as a first approximation).Simpler ? Adrien has a parametric strategy for stocks ==> we should see how to generalize it ==> transformation “constants → parameters” ==> DPS
- 16. Questions (strategic decisions for the DPS): - start with Adriens policy, improve it, generalize it, parametrize it ? interface with ARM ? - or another strategy ? - or a parametric V function, and we assume we have r(s,d) and newState(s,d) (often true) - or a parametric Q function ? (more generic, unusual but appealing, but neglects some existing knowledge r(s,d) and newState(s,d) )Further work: - finish the validation of Adriens policy on stock (better than random as a policy; better than random as a UCT-Monte-Carlo) - generalize ? variants ? - introduce into DPS, compare to the baseline (neural net) - introduce DPSs result into MCTS
- 17. Questions (strategic decisions for the DPS): - start with Adriens policy, improve it, generalize it, parametrize it ? interface with ARM ? - or another strategy ? - or a parametric V function, and we assume we have r(s,d) and newState(s,d) (often true) - or a parametric Q function ? (more generic, unusual but appealing, but neglects some existing knowledge r(s,d) and newState(s,d) )Further work: - finish the validation of Adriens policy on stock (better than random as a policy; better than random as a UCT-Monte-Carlo) - generalize ? variants ? - introduce into DPS, compare to the baseline (neural net) - introduce DPSs result into MCTS
- 18. Direct Policy Search: Optimization Tools& Optimization Tricks
- 19. - Classical tools: Evolution Strategies, Cross-Entropy, Pso, ... ==> more or less supposed to be robust to local minima ==> no gradient ==> robust to noisy objective function ==> weak for high dimension (but: see locality, next slide)- Hopefully: - good initialization: nearly convex - random seeds: no noise==> NewUoa is my favorite choice - no gradient - can “really” work in high-dimension - update rule surprisingly fast - people who try to show that their algorithm is better than NewUoa suffer a lot in noise-free case
- 20. Improvements of optimization algorithms: - active learning: when optimization on scenarios, choose “good” scenarios ==> maybe “quasi-randomization” ? Just choosing a representative sample of scenarios. ==> simple, robust... - local improvement: when a gradient step/update is performed, only update variables concerned by the simulation youve used for generating the update ==> difficult to use in NewUoa
- 21. Roadmap:- default policy for energy management problems: test, generalize, formalize, simplify...- this default policy ==> a parametric policy- test in DPS: strategy A- interface DPS with NewUoa and/or others (openDP opt?)- Strategy A: test into MCTS ==> Strategy B==> IMHO, strategy A = good tool for fast readable non-myopic results==> IMHO, strategy B = good for combining A with the efficiency of A for short term combinatorial effects.- Also, validating the partial observation (sounds good).

No public clipboards found for this slide

Be the first to comment