Reinforcement Learning to Mimic Portfolio Behavior

Reinforcement Learning to Mimic Portfolio Behavior
Yigal Jhirad
April 20, 2020
NVIDIA GTC 2020: Deep Learning and AI Conference
Silicon Valley

2
GTC 2020: Table of Contents
I. Reinforcement Learning
— Machine Learning Landscape
— Portfolio Mimicking Strategies
— Reinforcement Learning
— Deep Learning Networks
— Model Specification
— Monte Carlo Simulations/Generative Adversarial Network
— Mean/Variance vs. Reinforcement Learning/DQN
II. Summary
III. Author Biography
DISCLAIMER: This presentation is for information purposes only. The presenter accepts no liability for the content of
this presentation, or for the consequences of any actions taken on the basis of the information provided. Although the
information in this presentation is considered to be accurate, this is not a representation that it is complete or should be
relied upon as a sole resource, as the information contained herein is subject to change.

3
GTC 2020: Artificial Intelligence
Data: Structured/Unstructured
Asset Prices, Volatility
Fundamentals (P/E, PCE, Debt to Equity)
Macro (GDP Growth, Interest Rates, Oil prices)
Technical(Momentum)
Sentiment Analysis
Machine Learning
Unsupervised Learning
Cluster Analysis
Principal Components
Expectation Maximization
Generative Adversarial Network
Supervised Learning
Neural Networks
Support Vector Machines
Classification & Regression Trees
K-Nearest Neighbors
Regression
Reinforcement Learning
DQN
Q-Learning
Q-Matrix
Trial & Error

4
GTC 2020: Portfolio Mimicking Strategies
 Building Tracking Baskets
— Identify alternatives to gain exposure to common factors of an index
— Hedging portfolios to reduce volatility
— Replicate an investment Style
 Portfolio Construction and Replication
— Identify portfolio DNA by identifying factor exposures (e.g. Value/Growth, Large Cap vs. Small
Cap, Momentum)
— Robust tracking portfolios should have two main characteristics:
– Minimize Tracking Error
– Cointegration (Engle/Granger) to minimize drift - the tendency of portfolios to deviate over
time due to structural bias
4

5
 Traditional Mean/Variance optimization has limitations
— Stability
— Risk model bias
— Traditional tracking error variance minimization techniques do not explicitly embed
cointegration e.g. no guarantees that there will be a mean reverting processor and that tracking
error will be stationary
— As result the replicating portfolio will “drift” further away from the target and requires more
frequent rebalancing
 Cointegration
— Measures long term relationship and dependencies. Long Term equilibrium across asset prices.
— Error correction model
— Stationary tracking error
— Minimize cumulative and maximum performance difference (Drift) between model portfolio and
target

6
Drift:
Cumulative performance
difference between tracking
portfolios and target portfolio

7
GTC 2020: Reinforcement Learning
QValues
∑∂
∑∂
∑∂
∑∂
∑∂
𝑥2
𝑥1
𝑥3
𝑥4
𝑥5
Deep Q-Learning (DQN)
Q-Learning
State
Environment
Agent
Action
Policy
Rewards
Observations
𝑄 𝑠, 𝑎 𝑟 𝑠, 𝑎 + 𝛾𝑚𝑎𝑥
𝑎
𝑄 𝑠′
, 𝑎
∑∂

8
 So what is it about mimicking portfolios that make them appropriate for RL
— Environment/State: Partially Observable/Markov Decision Process
— Data generation process based on historical data, simulated environments with shocks to
environment to reflect risk off regimes
— Agent: Portfolio Management Process
— Action: Portfolio Rebalance
— Rewards: Short term reward – minimize tracking error and long term reward to minimize drift fits
within overall Bellman Equation
— Deep Q-Network implements deep Q-learning and replaces action/state table by a neural network
and learning of the value function through backpropagation
— Dynamic Programming/Feedback Loop

Dynamic Policy
Development
9
 The environment and state will integrate
the macro, fundamental, technical and
portfolio exposures
— Environment/State/Agent/Action/Rewards
— The Portfolio construction process (Agent)
will interact with the market environment
(Environment) updating the state (State)
and make portfolio decisions (Action) to
mimic the target portfolio over time
(Rewards)
— The actions will lead to a new portfolio
which will “react” with the current
environment
— Monte Carlo Simulation may be used to
stress the environment and randomly
simulate and perturb states
— Effectively the agent is developing a
dynamic policy that leads to rewards
— Utilize a Double DQN

10
GTC 2020: Deep Learning Networks
 Deep Q-Learning(DQN)
— Use a neural network framework to maximize rewards
— Model Free, off-policy Reinforcement Learning method
— Uses maximum Q-Value across all actions available in that state
— Value based temporal difference (TD)
 Double DQN
— Uses one DQN network for selecting and feedforwarding action and a target DQN network for
evaluating rewards
— Identify update frequency. Use target model to identify next state reward.
 Exploration vs. Exploitation
— Greedy, e-Greedy algorithm with time varying structure, Dynamic Boltzmann Softmax

11
STATE (Partially
Observable)
Inputs:
Fundamental/Macro/Technical
Factors: Price/Earnings>Momentum
Volatility/Correlations
Simulations
Volatility/Correlations
Q-Values
∑|∂
∑|∂
∑|∂
∑|∂
∑|∂
∑|∂
𝑥2
𝑥1
𝑥3
𝑥4
𝑥5
Reinforcement Learning: Deep Q-Learning and Double DQN
DQN Network
Q-Values
∑|∂
∑|∂
∑|∂
∑|∂
∑|∂
∑|∂
𝑥2
𝑥1
𝑥3
𝑥4
𝑥5
Target DQN Network
Short/Long Term
Reward
Cointegration/Drift
Periodically
Update

12
GTC 2020: Model Specification & Formulation
 Create a Target Portfolio based on specific rules and filters
— Utilize P/E, Momentum, Mean Reversion, Dividend Yield, Price to Book, Size, to build out
portfolio
— Number of securities range between 25 to 40 stocks
— Rebalance on a weekly basis to maintain consistency of factor exposures
 Construct a model portfolio based on variance minimization (the MV portfolio)
— A risk model based on a historical covariance matrix
 Construct a model portfolio based on RL, utilizing a double DQN network will attempt to replicate
exposures of this target portfolio
— The environment/state used for RL will integrate the valuation and technical factors across
securities
 Both processes can only look through to the portfolio once a month
— Mixed Integer Optimization used to limit the portfolio to no more than 20 names
 Review tracking risk and drift of these model portfolios vs. the target portfolio

13
GTC 2020: Reinforcement Learning Framework
 Reinforcement Learning Implementation
— Initialize network with random weights
— For each episode:
– For each time step
– Environment passes on state – the market conditions, portfolio positions, etc.
– Select action across a universe of stocks
– Execute action and evaluate short term reward and projected long term reward utilizing a
double DQN implementation
– The environment will “react” to the output and generate a new state
– The new state will integrate the action into its profile
– Randomly shock the environment
 Network hyperparameters
— Learning rate
— Discount Factor for long term rewards
— Incremental vs Mini-Batch module

14
GTC 2020: Monte Carlo Simulations
 Monte Carlo Simulation
— Shock Correlations/Volatilities
— Equities: Increase correlations and volatilities to simulate a risk-off regimes
— Fixed Income
– Parallel Shifts in Term Structure
– Shock Key Rate Durations
— Commodities: Shock Demand/Supply by perturbing term structure - Backwardation/Contango
— Currencies: Shock Volatilities/Correlation
 Assimilate these scenarios along side historical realized performance to complement the data generation
process
— Draw out factor exposures that may be latent variables and not very prominent due to risk
modeling risk regimes (e.g. low volatility environments)
— Drift may be an outcome of low risk environment where many factor remain latent and risk model
bias that emphasizes select factors

15
GTC 2020: Generative Adversarial Network
 Generative Adversarial Network
— Complement simulated data by using a generator and discriminator to play off against each other
to better simulate real world data
— Better absorb time varying and sequential data vs. a one time shock
— Capture long-range relationships such as the presence of volatility clusters
— Simulate stress conditions and identify the impact of shocks on tracking and drift
— Nash Equilibrium
Monte Carlo
Monte Carlo
Simulations
Predicted-Simulated
DataDiscriminator
Historical Data
Extract
GeneratorInput Data
Prices
Volatility
Correlations
Fundamentals
Technicals
Macro
Term Structure

16
Summary of Results: Mean/Variance vs. DQN
Mean/Variance, by design, has consistently lower predicted (ex-ante) tracking
error. Will this translate into lower drift?
Tracking Error
Predicted (Ex-Ante) Rolling Tracking Risk
Mean/Var Optimization vs. DQN

17
Mixed Results with Reinforcement Learning more effective at reducing drift from
2016-2018 and Mean/Variance doing better from 2013-2016

18
Drift:
Cumulative performance
difference between tracking
portfolios and target portfolio

19
GTC 2020: Summary
 Advantages
— Reinforcement Learning complements traditional optimization techniques to better mimic
portfolio behavior and create more robust portfolio replication solutions
— Reward structure fits nicely into optimization framework targeting variance minimization and
cointegration
— Mean/Variance optimization may help RL as a first pass in creating the initial portfolio weights
 Considerations
— Difficult to train
— Optimization/Local Minima and local convergence due to presence of non-convexity
— Computationally intensive. Time constraints.
— Apply Genetic algorithms
— Leverage CUDA
 More research needs to be done
19

20
Author Biography
 Yigal D. Jhirad, Senior Vice President, is Director of Quantitative and Derivatives Strategies
and Portfolio Manager for Cohen & Steers. Mr. Jhirad heads the firm’s Investment Risk
Committee. Prior to joining the firm in 2007, Mr. Jhirad was an executive director in the
institutional equities division of Morgan Stanley, where he headed the company’s portfolio and
derivatives strategies effort. He was responsible for developing quantitative and derivatives
products to a broad array of institutional clients. Mr. Jhirad holds a BS from the Wharton
School. He is a Financial Risk Manager (FRM), as Certified by the Global Association of Risk
Professionals.
 LinkedIn: linkedin.com/in/yigaljhirad

Reinforcement Learning to Mimic Portfolio Behavior

Recommended

Recommended

More Related Content

Similar to Reinforcement Learning to Mimic Portfolio Behavior

Similar to Reinforcement Learning to Mimic Portfolio Behavior (20)

Recently uploaded

Recently uploaded (20)

Reinforcement Learning to Mimic Portfolio Behavior