Upcoming SlideShare
×

# Direct policy search

800

Published on

Direct Policy Search (in short)
+ discussions of applications to stock problems

1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total Views
800
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
6
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Direct policy search

1. 1. DIRECT POLICY SEARCH0. What is Direct Policy Search ?1. Direct Policy Search: Parametric Policies for Financial Applications2. Parametric Bellman values for Stock Problems3. Direct Policy Search: Optimization Tools
2. 2. First, you need to know what is direct policy search (DPS). Principle of DPS: (1) Define a parametric policy Pi with parameters t1,...,tk. (2) maximize (t1,...,tk) → average reward when applying Policy pi(t1,...,tk) on the problem. ==> You must define Pi ==> You must choose a noisy optimization algorithm==> There is a Pi by default (an actor neural network), but its only a default solution (overload it)
3. 3. Strengths of DPS:- Good warm start If I have a solution for problem A, and if I switch to problem B close to A, then I quickly get good results.- Benefits from expert knowledge on the structure- No constraint on the structure of the objective function- Anytime (i.e. not that bad in restricted time) Drawbacks: - needs structured direct policy search - not directly applicable to partial observation
4. 4. Virtual MashDecision computeDecision(MashState & state, Const Vector<double> params) ==> “params” = t1,...,tk ==> returns the decision pi(t1,...,tk,state) Does it make sense ? Overload this function, and DPS is ready to work. Well, DPS (somewhere between alpha and beta) might be full of bugs :-)
5. 5. Direct Policy Search:Parametric Policies for Financial Application
6. 6. Bengio et al papers on DPS for financial applications Stocks (various assets) + Cash - Can be applied on data sets (no simulator, no elasticity model) decision = tradingUnit(A, prevision(B,data)) because policy has no impact on prices Where:- tradingUnit is designed by human experts - 22 params in first paper- previsions outputs are chosen by human experts - reduced weight sharing- prevision is a neural network- A and B are parameters in other paper ==> ~ 800 parametersThen, (if I understand correctly)B is optimized by LMS (prevision criterion) ==> poor results, little correlation between - there exist much bigger DPS LMS and financial performanceA and B are optimized on the expected return (Sigaud et al., 27 000) (by DPS) ==> much better - nb: noisy optimization
7. 7. An alternate solution:parametric Bellman values for Stock Problems
8. 8. What is a Bellman function ?V(s): expected benefit, in the future, if playing optimally from state s.V(s) is useful for playing optimally.
9. 9. Rule for an optimal decision: d(s) = argmax V(s) + r(s,d) d- s=nextState(s,d)- d(s): optimal decision in state s- V(s): Bellman value in state s- r(s,d): reward associated to decision d in state s
10. 10. Remark 1: V(s) knownup to an additive constant is enough Remark 2: dV(s)/d(si) is the price of stock i Example with one stock, soon.
11. 11. Q-rule for an optimal decision: d(s) = argmax Q(s,d) d- d(s): optimal decision in state s- Q(s,d) : optimal future reward if decision = d in s==> approximate Q instead of V==> we dont need r(s,d) nor newState(s,d)
12. 12. I have enough stock; I pay only if itsV(stock) (in euros) cheap. I need a lot of stock! I accept to pay a lot. Slope = marginal price (euros/KWh) Stock (in kWh)
13. 13. Examples:For one stock: - very simple: constant price - piecewise linear (can ensure convexity) - “tanh” function - neural network, SVM, sum of Gaussians...For several stocks: - each stock separately - 2-dimensional: V(s1,s2,s3)=V(s1,S)+v(s2,S)+v(s3,S) where S=a1.s1+a2.s2+a3.s3 - neural network, SVM, sum of Gaussians...
14. 14. How to choose coefficients ?- dynamic programming: robust, but slow in high dim- direct policy search: - initializing coefficients from expert advice - or: supervised machine learning for approximating an expert advice ==> and then optimize
15. 15. Conclusions:V: Very convenient representation of policy: we can view prices.Q: some advantages (model-free models)Yet, less readable than direct rules.And expensive: we need one optimization for making the decision, for each time step of a simulation. ==> but this optimization can be a simple sort (as a first approximation).Simpler ? Adrien has a parametric strategy for stocks ==> we should see how to generalize it ==> transformation “constants → parameters” ==> DPS
16. 16. Questions (strategic decisions for the DPS): - start with Adriens policy, improve it, generalize it, parametrize it ? interface with ARM ? - or another strategy ? - or a parametric V function, and we assume we have r(s,d) and newState(s,d) (often true) - or a parametric Q function ? (more generic, unusual but appealing, but neglects some existing knowledge r(s,d) and newState(s,d) )Further work: - finish the validation of Adriens policy on stock (better than random as a policy; better than random as a UCT-Monte-Carlo) - generalize ? variants ? - introduce into DPS, compare to the baseline (neural net) - introduce DPSs result into MCTS
17. 17. Questions (strategic decisions for the DPS): - start with Adriens policy, improve it, generalize it, parametrize it ? interface with ARM ? - or another strategy ? - or a parametric V function, and we assume we have r(s,d) and newState(s,d) (often true) - or a parametric Q function ? (more generic, unusual but appealing, but neglects some existing knowledge r(s,d) and newState(s,d) )Further work: - finish the validation of Adriens policy on stock (better than random as a policy; better than random as a UCT-Monte-Carlo) - generalize ? variants ? - introduce into DPS, compare to the baseline (neural net) - introduce DPSs result into MCTS
18. 18. Direct Policy Search: Optimization Tools& Optimization Tricks
19. 19. - Classical tools: Evolution Strategies, Cross-Entropy, Pso, ... ==> more or less supposed to be robust to local minima ==> no gradient ==> robust to noisy objective function ==> weak for high dimension (but: see locality, next slide)- Hopefully: - good initialization: nearly convex - random seeds: no noise==> NewUoa is my favorite choice - no gradient - can “really” work in high-dimension - update rule surprisingly fast - people who try to show that their algorithm is better than NewUoa suffer a lot in noise-free case
20. 20. Improvements of optimization algorithms: - active learning: when optimization on scenarios, choose “good” scenarios ==> maybe “quasi-randomization” ? Just choosing a representative sample of scenarios. ==> simple, robust... - local improvement: when a gradient step/update is performed, only update variables concerned by the simulation youve used for generating the update ==> difficult to use in NewUoa
21. 21. Roadmap:- default policy for energy management problems: test, generalize, formalize, simplify...- this default policy ==> a parametric policy- test in DPS: strategy A- interface DPS with NewUoa and/or others (openDP opt?)- Strategy A: test into MCTS ==> Strategy B==> IMHO, strategy A = good tool for fast readable non-myopic results==> IMHO, strategy B = good for combining A with the efficiency of A for short term combinatorial effects.- Also, validating the partial observation (sounds good).
1. #### A particular slide catching your eye?

Clipping is a handy way to collect important slides you want to go back to later.