Reinforcement Learning to Mimic Portfolio Behavior
1. Reinforcement Learning to Mimic Portfolio Behavior
Yigal Jhirad
April 20, 2020
NVIDIA GTC 2020: Deep Learning and AI Conference
Silicon Valley
2. 2
GTC 2020: Table of Contents
I. Reinforcement Learning
— Machine Learning Landscape
— Portfolio Mimicking Strategies
— Reinforcement Learning
— Deep Learning Networks
— Model Specification
— Monte Carlo Simulations/Generative Adversarial Network
— Mean/Variance vs. Reinforcement Learning/DQN
II. Summary
III. Author Biography
DISCLAIMER: This presentation is for information purposes only. The presenter accepts no liability for the content of
this presentation, or for the consequences of any actions taken on the basis of the information provided. Although the
information in this presentation is considered to be accurate, this is not a representation that it is complete or should be
relied upon as a sole resource, as the information contained herein is subject to change.
4. 4
GTC 2020: Portfolio Mimicking Strategies
Building Tracking Baskets
— Identify alternatives to gain exposure to common factors of an index
— Hedging portfolios to reduce volatility
— Replicate an investment Style
Portfolio Construction and Replication
— Identify portfolio DNA by identifying factor exposures (e.g. Value/Growth, Large Cap vs. Small
Cap, Momentum)
— Robust tracking portfolios should have two main characteristics:
– Minimize Tracking Error
– Cointegration (Engle/Granger) to minimize drift - the tendency of portfolios to deviate over
time due to structural bias
4
5. 5
GTC 2020: Portfolio Mimicking Strategies
Traditional Mean/Variance optimization has limitations
— Stability
— Risk model bias
— Traditional tracking error variance minimization techniques do not explicitly embed
cointegration e.g. no guarantees that there will be a mean reverting processor and that tracking
error will be stationary
— As result the replicating portfolio will “drift” further away from the target and requires more
frequent rebalancing
Cointegration
— Measures long term relationship and dependencies. Long Term equilibrium across asset prices.
— Error correction model
— Stationary tracking error
— Minimize cumulative and maximum performance difference (Drift) between model portfolio and
target
6. 6
GTC 2020: Portfolio Mimicking Strategies
Drift:
Cumulative performance
difference between tracking
portfolios and target portfolio
8. 8
GTC 2020: Reinforcement Learning
So what is it about mimicking portfolios that make them appropriate for RL
— Environment/State: Partially Observable/Markov Decision Process
— Data generation process based on historical data, simulated environments with shocks to
environment to reflect risk off regimes
— Agent: Portfolio Management Process
— Action: Portfolio Rebalance
— Rewards: Short term reward – minimize tracking error and long term reward to minimize drift fits
within overall Bellman Equation
— Deep Q-Network implements deep Q-learning and replaces action/state table by a neural network
and learning of the value function through backpropagation
— Dynamic Programming/Feedback Loop
9. GTC 2020: Reinforcement Learning
Dynamic Policy
Development
9
The environment and state will integrate
the macro, fundamental, technical and
portfolio exposures
— Environment/State/Agent/Action/Rewards
— The Portfolio construction process (Agent)
will interact with the market environment
(Environment) updating the state (State)
and make portfolio decisions (Action) to
mimic the target portfolio over time
(Rewards)
— The actions will lead to a new portfolio
which will “react” with the current
environment
— Monte Carlo Simulation may be used to
stress the environment and randomly
simulate and perturb states
— Effectively the agent is developing a
dynamic policy that leads to rewards
— Utilize a Double DQN
10. 10
GTC 2020: Deep Learning Networks
Deep Q-Learning(DQN)
— Use a neural network framework to maximize rewards
— Model Free, off-policy Reinforcement Learning method
— Uses maximum Q-Value across all actions available in that state
— Value based temporal difference (TD)
Double DQN
— Uses one DQN network for selecting and feedforwarding action and a target DQN network for
evaluating rewards
— Identify update frequency. Use target model to identify next state reward.
Exploration vs. Exploitation
— Greedy, e-Greedy algorithm with time varying structure, Dynamic Boltzmann Softmax
12. 12
GTC 2020: Model Specification & Formulation
Create a Target Portfolio based on specific rules and filters
— Utilize P/E, Momentum, Mean Reversion, Dividend Yield, Price to Book, Size, to build out
portfolio
— Number of securities range between 25 to 40 stocks
— Rebalance on a weekly basis to maintain consistency of factor exposures
Construct a model portfolio based on variance minimization (the MV portfolio)
— A risk model based on a historical covariance matrix
Construct a model portfolio based on RL, utilizing a double DQN network will attempt to replicate
exposures of this target portfolio
— The environment/state used for RL will integrate the valuation and technical factors across
securities
Both processes can only look through to the portfolio once a month
— Mixed Integer Optimization used to limit the portfolio to no more than 20 names
Review tracking risk and drift of these model portfolios vs. the target portfolio
13. 13
GTC 2020: Reinforcement Learning Framework
Reinforcement Learning Implementation
— Initialize network with random weights
— For each episode:
– For each time step
– Environment passes on state – the market conditions, portfolio positions, etc.
– Select action across a universe of stocks
– Execute action and evaluate short term reward and projected long term reward utilizing a
double DQN implementation
– The environment will “react” to the output and generate a new state
– The new state will integrate the action into its profile
– Randomly shock the environment
Network hyperparameters
— Learning rate
— Discount Factor for long term rewards
— Incremental vs Mini-Batch module
14. 14
GTC 2020: Monte Carlo Simulations
Monte Carlo Simulation
— Shock Correlations/Volatilities
— Equities: Increase correlations and volatilities to simulate a risk-off regimes
— Fixed Income
– Parallel Shifts in Term Structure
– Shock Key Rate Durations
— Commodities: Shock Demand/Supply by perturbing term structure - Backwardation/Contango
— Currencies: Shock Volatilities/Correlation
Assimilate these scenarios along side historical realized performance to complement the data generation
process
— Draw out factor exposures that may be latent variables and not very prominent due to risk
modeling risk regimes (e.g. low volatility environments)
— Drift may be an outcome of low risk environment where many factor remain latent and risk model
bias that emphasizes select factors
15. 15
GTC 2020: Generative Adversarial Network
Generative Adversarial Network
— Complement simulated data by using a generator and discriminator to play off against each other
to better simulate real world data
— Better absorb time varying and sequential data vs. a one time shock
— Capture long-range relationships such as the presence of volatility clusters
— Simulate stress conditions and identify the impact of shocks on tracking and drift
— Nash Equilibrium
Monte Carlo
Monte Carlo
Simulations
Predicted-Simulated
DataDiscriminator
Historical Data
Extract
GeneratorInput Data
Prices
Volatility
Correlations
Fundamentals
Technicals
Macro
Term Structure
16. 16
Summary of Results: Mean/Variance vs. DQN
Mean/Variance, by design, has consistently lower predicted (ex-ante) tracking
error. Will this translate into lower drift?
Tracking Error
Predicted (Ex-Ante) Rolling Tracking Risk
Mean/Var Optimization vs. DQN
17. 17
Summary of Results: Mean/Variance vs. DQN
Mixed Results with Reinforcement Learning more effective at reducing drift from
2016-2018 and Mean/Variance doing better from 2013-2016
18. 18
Summary of Results: Mean/Variance vs. DQN
Drift:
Cumulative performance
difference between tracking
portfolios and target portfolio
19. 19
GTC 2020: Summary
Advantages
— Reinforcement Learning complements traditional optimization techniques to better mimic
portfolio behavior and create more robust portfolio replication solutions
— Reward structure fits nicely into optimization framework targeting variance minimization and
cointegration
— Mean/Variance optimization may help RL as a first pass in creating the initial portfolio weights
Considerations
— Difficult to train
— Optimization/Local Minima and local convergence due to presence of non-convexity
— Computationally intensive. Time constraints.
— Apply Genetic algorithms
— Leverage CUDA
More research needs to be done
19
20. 20
Author Biography
Yigal D. Jhirad, Senior Vice President, is Director of Quantitative and Derivatives Strategies
and Portfolio Manager for Cohen & Steers. Mr. Jhirad heads the firm’s Investment Risk
Committee. Prior to joining the firm in 2007, Mr. Jhirad was an executive director in the
institutional equities division of Morgan Stanley, where he headed the company’s portfolio and
derivatives strategies effort. He was responsible for developing quantitative and derivatives
products to a broad array of institutional clients. Mr. Jhirad holds a BS from the Wharton
School. He is a Financial Risk Manager (FRM), as Certified by the Global Association of Risk
Professionals.
LinkedIn: linkedin.com/in/yigaljhirad