Machine Learning in Finance

MACHINE LEARNING IN FINANCE
CREATED BY: HAMED VAHEB
FALL 2018

1
2. ML in Tech vs ML in Finance
3. Example: Bank Rating Prediction
4. Deep Learning and Neural Networks
5. Example: Neural Net Copula in Markoviz Problem
1. Fundamentals of Machine Learning

3
Major AI Approaches
• Logic and Rules-Based Approach
• Hard-code knowledge about the world in formal languages
• Top-down rules are created for computers
• Computers reason about these rules automatically.
Example: Project Cyc (Lenat and Guha, 1989)

4
Major AI Approaches
Example within law – Expert Systems
• Turbotax
• Personal income tax laws
• Represented as logical computer rules
• Software computers tax liability
• Logic and Rules-Based Approach

5
Learning: Process of converting experience into expertise or knowledge
We wish to program “agents” that they can “learn” from input data
ML is what computers use to learn about the outside world. Much like humans
use math and physics for the same purpose.
Agent = Architecture + Algorithm
AI systems need the ability to acquire their own knowledge, by extracting
patterns from raw data.
• Machine Learning (Pattern-Based Approach)
Major AI Approaches

6
Machine Learning in our daily life

7
Example: Email Spam
Filter

8
Example: Email Spam
Filter

9
Example: AARON

10
Formal Definition
Field of study that gives computers the ability to learn without being
explicitly programmed
Arthur Samuel (1959):
Well posed Learning Problem: A computer program is said to learn
from experience E with respect to some task T and some
performance measure P, if its performance on T, as measured by P,
improves with experience E.
Tom Mitchell (1998):
Example: Chess
T: playing chess, E: agent playing with itself, P: number of wins / number of games

11
Studies “intelligent agents” that perceive their environment and perform different
actions to solve tasks that involve mimicking cognitive functions of human brain
(Russell, Norvig)
Artificial Intelligence
Goals of AI
Knowledge
Representation
Taking Actions,
Planning
Perception and
Learning
Natural
Language
Processing
Automated
Reasoning
Ontology: the
set of objects,
relations,
concepts
Acting with
visualizing future
to achieve goals
Perception from
sensors,
learning from
experience
Ability to read
and understand
human language
Mimicking
human
reasoning for
logical
deductions
M
L

12
Perception
(learning),
actions
Communication
(NLP)
Knowledge/Ont
ologies
Reasoning,
planning
Applied AI
Learns and
acts
autonomously
Use sub-
symbolic
information
Algorithmic
theory of
cognitive acts
Solves any
intellectual
tasks
Artificial General
Intelligence (AGI)
Present Future

15
13
Agent Environment
Perception
Actions
Perception Tasks: There is a fixed action
(Perception via The physical world (through sensors), or digital
data (read from a disk))
Action Tasks: There are multiple possible actions
involve planning and forecasting the future
involve sub-tasks of learning, for sequential (multi-step) problems
(Actions can be fixed, or can vary. May or may not change the
environment)

14
When do we need ML (instead of directly program)?
• Complexity
1. Tasks performed by Animals/Humans: Can’t extract a well
defined program. (Driving, speech recognition, image understanding)
2. Tasks beyond Human Capabilities: Analysis of very large and
complex datasets (Astronomical and genomic data, turning medical
archives to medical knowledge, weather prediction)
• Adaptivity
adaptive to changes in the environment they interact with.
(handwritten text, spam detection, speech recognition)

15
Types of learning
• Supervised: environment (teacher) that “supervises” the learner by
providing the extra information (“labels”). We have train (seen) and test
(unseen) data.
p(y|x)
• Unsupervised: come up with summary or a compressed version of
data, learn probability distribution, clustering (denoise, synthesis)
• Reinforcement: Intermediary. There is teacher but with partial feedback
(reward), sequence of actions. (describe chess’s setting position value,
Self-drive)

16
Supervised Learning
Most common types

17
Linear Regression Example: satisfaction rate of company employees
Training data: company employees have rated their satisfaction on a scale of 1 to 100
Predictor:

18
Let’s start with

19
Cost Function:
As we minimized J (using Gradient Descent, the fitting line gets better and better

22
20
Best Line:

21
Minimization Algorithm: Gradient Descent
𝑔 𝜃 =
𝜕
𝜕
𝑔

22
Plot of J
In this case, J is convex and therefore there is no local minima!

23
J cantors

24
Iterations
Fore more visualization:
https://towardsdatascience.com/machine-learning-fundamentals-via-linear-regression-
41a5d11f5220

25
Unsupervised Learning
Example: Dimension Reduction

26
Supervised vs Unsupervised
Clusterin
g
Classificati
on

27
Reinforcement

28
Linear Regression for Stock
Market

29
Machine Learning Landscape
Supervised Learning Unsupervised Learning
Learn regression
Function
Given: input/output
pairs
Regression Classification
Representation
Learning
Clustering
Learn regression
Function
Given: input/output
pairs
Learn class
Function
k: the number of
clusters
Given: inputs only
Learn representer
function
Given: input/output
pairs
Perception Tasks

30
Machine Learning Landscape
Reinforcement Learning
Learn regression
Function
Given: input/output
pairs
Optimization of
strategy for a task
IRL: Learn
objectives from
behavior
Learn regression
Function
Given: input/output
pairs
Action Tasks

31
Machine Learning Examples
Demand Forecast
Representation
Learning
Clustering
Spam detection
Image recognition
Document
classification
Customer
segmentation
Anomaly detection
Text recognition
Machine
Translation
Perception Tasks

32
Machine Learning Examples
Robotics
Computational
advertising
Optimization of
strategy for a task
IRL: Learn
objectives from
behavior
Imitation learning
for robotics
Action Tasks

33
Machine Learning Methods
Linear regression
Trees: CART
SMV/SRV
Ensemble methods
Neural Networks
Representation
Learning
Clustering
Logistic regression
Naive Bayes
Nearest neighbors
SVM
Decision trees
Ensemble methods
Neural Networks
K-means
Hierarchical
clustering
Guassian matrix
Hidden Markov
Models
Neural Networks
PCA
Factor models
ICA
Dimension reduction
Manifold learning
Neural Networks
Perception Tasks

34
Machine Learning Methods
Model-based RL
Model-free RL
Batch/online RL
RL with linear
models
Neural Networks
Optimization of
strategy for a task
IRL: Learn
objectives from
behavior
Model-based IRL
Model-free IRL
Batch/online IRL
MaxEnt IRL
Neural Networks
Action Tasks

35
Machine Learning in Finance
Earning prediction
Credit loss forecast
Algorithmic trading
Representation
Learning
Clustering
Rating prediction
Default modeling
Credit card fraud
Anti-money laundry
Customer
segmentation
Stock
segmentation
Factor modeling
De-noising
Regime change
detection
Perception Tasks

36
Trading strategies
Asset management
Optimization of
strategy for a task
IRL: Learn
objectives from
behavior
Reverse engineering
of consumer
behavior, trading
strategies, …
Action Tasks
Machine Learning in Finance

37
ML by Financial Application Areas
Banking Asset Management
Customer
segmentation
Loan defaults
Credit card defaults
Fraud detection
Anti-money laundry
Retail P2P
Lending
Commercial and
Investment
Portfolio
optimization
Representation
Learning
Rating prediction
Default modeling
Client data mining
Recommender
systems
Factor modeling
De-noising
Regime change
Detection
Stock segmentation
Multi-period
portfolio
optimization
Derivatives trading
Perception Tasks

38
Quantitative Trading
Profit-maximizing
trading execution
Optimal trade
execution
Quantitative trading
strategies
Earning prediction
Algorithmic trading
Optimal market
making
Action Tasks
ML by Financial Application Areas

ML in Tech
• Perception (image recognition, NLP tasks, etc.)
Methods: SL/UL
• Action (computational advertising, robotics, self-driving cars, etc.). Methods:
SL/UL/RL
39
ML in Tech ML in Finance
Image
recognition
NLP Tasks
Forecasting
Tasks
Valuation
Tasks
Computational
advertising
Robotics

ML in Finance
Perception: Forecasting tasks
• Security price predictions (stocks, bonds, commodities, etc.).
Methods: SL/UL
• Corporate actors action prediction (dividends, mergers, defaults, etc.).
Methods: SL/UL/RL
• Individual actors action prediction (loan defaults, fraud, AML, etc.).
Methods: SL/UL/RL 40
Image
recognition
NLP Tasks
Forecasting
Tasks
Valuation
Tasks
Computational
advertising
Robotics

ML in Finance
Perception: Valuation tasks
• Asset valuation (stocks, futures, commodities, bonds, etc.). Related to forecasting.
Methods: SL/UL
• Derivatives valuation.
Methods: SL/UL/RL
41
Image
recognition
NLP Tasks
Forecasting
Tasks
Valuation
Tasks
Computational
advertising
Robotics

42
Tasks ML in Tech ML for Finance
Big Data? typically yes typically no
Data for ML in Tech are of huge size.
Most of data for ML in Finance are medium-size, except HFT.

43
Stationary Data? typically yes typically no
As most of financial data
are non-stationary,
collecting more data, even
when possible is not
always helpful

44
Noise-to-signal ratio typically low typically high
Financial data are typically quite noisy,
“true” signals are unobservable!

45
Interpretability of results Typically, not important, or
not the main focus
Typically, either desired or
required
Interpretability of results is:
• Desired for trading
• Required for regulation (General Data
Protection Regulation, 2018)

46
Action (RL) tasks Low dimensional state-action
space, low uncertainty
High-dimensional state-
action space, high
uncertainty
• ML in Tech: Dimensionality of the state-action space is usually in
hundreds.
The action space is often more discrete (except in robotics)
Uncertainty is low to moderate (think self-driving cars!)
• ML in Finance: Dimensionality of the state-action space is often
in thousands.
The action space is usually continuous.
Uncertainty is low to high (think Brexit!)

47
A Gentle Model (Statistical Learning Framework)
 Domain set: features
 Label set
(discrete or continuous)
 Training data: also called training set (seen)
The learner’s input:
 Prediction function (hypothesis)
 Data-generation model: probability distribution of
 Measure of success: error of predictor, loss function
The learner’s output:

48
Types of Error
• The ability to perform well on previously unobserved inputs is called generalization
• What separates machine learning from optimization is that we want the generalization
error to be low as well
• Estimate generalization error by a test set of examples that were collected separately
from the training set
Error measure on the training set
Training error
𝐿 𝐷,𝑓 ℎ ≝ 𝑃𝑥 𝐷 ℎ 𝑥 ≠ 𝑦
Generalization error (Test error):

49
• We sample the training set, then use it to choose the parameters to
reduce training set error. Under this process, the expected test error is
greater than or equal to the expected value of training error
• The factors determining how well a machine learning algorithm will
perform are its ability to
1. Make the training error small (underfitting)
2. Make the gap between training and test error small (overfitting)
Types of Error

50
Papayas Example
𝐿 𝐷 ℎ 𝑆
= 1 2
𝐿 𝑆 ℎ 𝑆
= 0
• No matter what the sample is ,
• Predicts label 1 only an finite number of instances:
• We have found a predictor whose performance on the training set is excellent, yet its
performance on the true “world” is very poor

51
Example

52
• Overfitting occurs when our hypothesis fits the training data “too well” (perhaps
like the everyday experience that a person who provides a perfect detailed
explanation for each of his single actions may raise suspicion).
Altering Capacity
• Model’s capacity is its ability to fit a wide variety of functions.
• Capacity is controlled by Restrict hypothesis class (size or complexity), VC
dimension, techniques, program bits, …
• Restrict to axis aligned rectangles guarantees not to overfit
• If H is a finite class, then ERMH will not overfit

53
Bias – Complexity Tradeoff
Error Decomposition
Approximation Error
• Due to underfitting
• the minimum risk achievable by a predictor in the hypothesis class.
• how much risk we have because we restrict ourselves to a specific class (bias)
• depends on the chosen hypothesis class
• Reflects the quality of prior knowledge
Estimation Error
• Due to overfitting
• the difference between the approximation error and the predictor error
• It exists because the training error is only an estimate of the generalization error
• depends on the training set size and on the size or complexity of the hypothesis class

54

55
Model Capacity
DataComplexity

56
Generalization
Design Matrix
• A model is trained using only a training set
• A test set is used to estimate algorithm’s ability to generalize, i.e. perform well on
unseen data.

57
• To generalize well, machine learning algorithms need to be guided by prior beliefs
about what kind of function they should learn.
• the stronger the prior knowledge (or prior assumptions) that one starts the learning
process with, the easier it is to learn from further examples. However, the stronger
these prior assumptions are, the less flexible the learning is (it is bound, a priori, by the
commitment to these assumptions.)
Prior Knowledge
• Restricting our hypothesis class (Finiteness, VC Dimension)
• Assumption on distribution
Examples

58
Prior Knowledge
Bait
Shyness
The rats seem to have some “built in” prior knowledge telling them that, while temporal
correlation between food and nausea can be causal, it is unlikely that there would be a
causal relationship between food consumption and electrical shocks or between sounds
and nausea.

59
Pigeon Superstition
Prior Knowledge

60
ML vs Statistical Modeling

61
3. Bank Failures Example
FDI
C
• US-based commercial banks are regulated by the FDIC
• FDIC provides insurance for commercial banks, and charges them insurance premium
according to an internal (and non-public) rating based on the CAMELS supervisory
system

62
Importance

63
CAMEL
S • Rate 1: Best, Rate 5: Worst
• Rating 4 or 5 is likely to be closed soon
Capital inadequacy is the most common cause of a
bank closure (other reasons: violation of financial
rules, management failures)
If FDIC decides to close the bank, it takes over both
its assets and its liabilities and then tries to sell the
assets at the best price possible to pay up the
liabilities.
• CAMEL ratings are not publicly known; However,
Call Reports are available.
• In addition, FDIC provides historical data for failed
banks:
(https://www.fdic.gov/bank/individual/failed/)

64
Call Report
• 28 schedules in total
• Form FFIEC 031: for banks with both domestic (US) and foreign offices
• Form FFIEC 041: for banks with domestic (US) offices only

65
Call Report Content (Schedules)

66
Call Report Content (Schedules)

67
Correlation Matrix of features
In this problem we want to predict failed(defaulter) Banks and non-failed Banks
NI: net income
log_TA: logarithm of total assets
TL: total loans
NPL: non-performing loans
Assessment Base: average consolidated assets minus tangle equity
…

68
Defaulter by log_TA in Training data

68
Defaulter by log_TA in Test data

70
Logistic Regression used for classification

71
Training

72
Training

73
Testing

75
The performance of simple machine learning algorithms depends heavily on the
representation of the data they are given.
Goal: separate the factors of variation
Problem: influence every single piece of data we are able to observe. (car
image at night, car )
Most applications require us to disentangle the factors of variation and discard
the ones that we do not care about
Representation Learning: use ML to discover not only the mapping from
representation to output but also the representation itself.
quintessential example: Autoencoder
the combination of an encoder function, which converts the input data into a
different representation, and a decoder function, which converts the new
representation back into the original format.
 Representation

76
Example
 Representation

77
Deep learning solves this problem by introducing representations that are
expressed in terms of other, simpler representations.
(build complex concepts out of simpler concepts. )
Example

77
 Depth
Depth enables the computer to learn a multistep computer program
Layer: state of the computer’s memory after executing another set of instructions in
parallel
Networks with greater depth can execute more instructions in sequence. (later
instructions can refer back to the results of earlier instructions.
Measuring Depth
1. Depth of computational graph: number of sequential instructions (length of the
longest path through a flow chart)
2. Depth of the concepts graph: describing how concepts are related to each other.
• Depth of the flowchart of the computations needed to compute the representation of
each concept may be much deeper than the graph of the concepts themselves

78
Depth = 3 Depth = 2

79

80

81
History of
DL
• Dates back to 1940s (only appears to be new)
• Different Names:
1. 1940s - 1960: Cybernetics
2. 1980s – 1990s: Connectionism
3. Beginning of 2006: Deep Learning
4. learning algorithms for biological learning (models of how learning happens or
could happen in brain): Artificial Neural Networks
Neural Perspective on DL
1. Brain provides a proof that intelligent behavior is possible
2. Reverse engineer the computational principles behind the brain
• Today, neuroscience is regarded as an important source of inspiration for DL
researchers, but it is no longer the predominant guide for the field because To obtain a
deep understanding of the actual algorithms used by the brain, we would need to be
able to monitor the activity of (at the very least) thousands of interconnected neurons
simultaneously.
• The basic idea of having many computational units that become intelligent only via their
interactions with each other is inspired by the brain
• 1980s algorithms work quite well, but this was not apparent circa 2006 because they
were too computationally costly.

82
• Increasing Dataset sizes: Some skill is required to get good performance from a DL
algorithm. Fortunately, the amount of skill required reduces as the amount of training
data increases.
The age of “Big Data” has made ML much easier because the key burden of statistical
estimation (generalizing to new data after observing only a small amount) has been
considerably lightened.
• Increasing Model Sizes: animals become intelligent when many of their neurons work
together. Larger networks are able to achieve higher accuracy on more complex tasks.
History of
DL

83
Challenges motivating DL
• Curse of Dimensionality
Regions Regions Regions
statistical challenge arises because the number of possible configurations of x is much
larger than the number of training examples.

84
www.playground.tensorflow.org
• Local Constancy and Smoothness
Among the most widely used of these implicit “priors” is the smoothness
prior, or local constancy prior.
It states that the function we learn should not change very much within a small region.
Much of the modern motivation for deep learning is derived from studying the limitations of
local template matching and how deep models are able to succeed in cases where local
template matching fails (Bengio et al., 2006b).

85
Neural Networks
Feedforward Neural Network (MLP)
Goal: approximate some function with some
Feedforward: information flows through the function with no feedback connections
Neural: loosely inspired by neuroscience
Network: composing together many different functions .
( is the ’th layer and final layer is output layer)
Depth: overall length of the chain
Width: dimensionality of hidden layers
Hidden Layer: Training data does not show the desired output for each of these layers
• During NN training, we drive to match
• Each hidden layer is vector valued

86
Depth
𝑓 1
𝑓 2
𝑓 3
Width

87
MLP as a kernel technique
extend linear models to represent nonlinear functions of by applying the linear model not to
, but to a transformed input
How to choose
1. Generic: infinite-dimensional(based on RBF kernel).
Enough capacity but poor generalization
2. Manually Engineer : Requires decades of human effort for each separate task
3. Learn :
This is an example of a deep feedforward network, with defining a hidden layer
• The advantage of 3’rd approach is that the human designer only needs to find the right
general function family rather than finding precisely the right function.

88
Example: Learning XOR
• After solving: and
, where and
• Most neural networks establish a nonlinear function by using a affine transformation
controlled by learned parameters, followed by a fixed nonlinear function called an
activation function.
or , where

89
When , the model’s output must increase as increases. When
, the model’s output must decrease as increases.

90
,

91
Recurrent Neural Network (RNN)
• For processing a sequence of values . ( can be variable)
• Parameter sharing: using the same parameter for more than one function in a
model (tied weights).
If we had separate parameters for each value of the time index, we could
not generalize to sequence lengths not seen during training, nor share
statistical strength across different sequence lengths and across different
positions in time. Such sharing is particularly important when a specific piece
of information can occur at multiple positions within the sequence. (“I went
to Nepal in 2009” and “In 2009, I went to Nepal)
• Each member of the output is a function of the previous members of the output. Each
member of the output is produced using the same update rule applied to the previous
outputs.
• Include cycles that represent the influence of the present value of a variable on its own
value at a future time step.
• Any function involving recurrence can be considered a recurrent neural network.

92
Parameter Sharing

93
Unfolding Computational Graphs
The unfolding process thus introduces two major advantages:
1. Regardless of the sequence length, the learned model always has the same
input size, because it is specified in terms of transition from one state to
another state, rather than specified in terms of a variable-length history of
states.
2. It is possible to use the same transition function f with the same parameters
at every time step.

94
Some types of
RNNs
I. Produce an output at each time step and have recurrent connections between hidden
units
II. Produce an output at each time step and have recurrent connections only from the
output at one time step to the hidden units at the next time step.
III. With recurrent connections between hidden units, that read an entire sequence and
then produce a single output
• The network with recurrent connections only from the output at one time step to
the hidden units at the next time step is strictly less powerful because it lacks hidden-to-
hidden recurrent connections. For example, it cannot simulate a universal Turing
machine. It requires that the output units capture all the information about the past that
the network will use to predict the future.

95
I

96
II

97

98
III

99
Teacher Forcing
a procedure that emerges from the maximum likelihood criterion, in which during training
the model receives the ground truth output as input at time .
𝑙𝑜𝑔𝑝 𝑦 1
, 𝑦 2
𝑥 1
, 𝑥 2
= 𝑙𝑜𝑔𝑝 𝑦 2
𝑦 1
, 𝑥 1
, 𝑥 2
+ 𝑙𝑜𝑔𝑝 𝑦 1
𝑦 1
, 𝑥 1
, 𝑥 2
• avoid back-propagation through time in models that lack hidden-to-hidden connections.
Teacher forcing may still be applied to models that have hidden-to-hidden connections
as long as they have connections from the output at one time step to values computed
in the next time step.
• As soon as the hidden units become a function of earlier time steps, however, the BPTT
algorithm is necessary.
• Some models may thus be trained with both teacher forcing and BPTT.

104
100

101
Any time we choose a specific machine learning algorithm, we are implicitly stating some
set of prior beliefs we have about what kind of function the algorithm should learn.
Choosing a deep model encodes a very general belief that the function we want to learn
should involve composition of several simpler functions. This can be interpreted from a
representation learning point of view as saying that we believe the learning problem
consists of discovering a set of underlying factors of variation that can in turn be described
in terms of other, simpler underlying factors of variation. Alternately, we can interpret the
use of a deep architecture as expressing a belief that the function we want to learn is a
computer program consisting of multiple steps, where each step makes use of the previous
step’s output. These intermediate outputs are not necessarily factors of variation but can
instead be analogous to counters or pointers that the network uses to organize its internal
processing. Empirically, greater depth does seem to result in better generalization
Last Note

102
References
1. Understanding Machine Learning: From Theory to
Algorithms (Shai Ben-David and Shai Shalev-
Shwartz)
2. Deep Learning (Aaron C. Courville, Ian Goodfellow,
and Yoshua Bengio)
3. “Machine Learning in Finance” course
(www.coursera.org)
4. Advances in Financial Machine Learning (marcos
lopez de prado)

Machine Learning in Finance

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Machine Learning in Finance

Similar to Machine Learning in Finance (20)

Recently uploaded

Recently uploaded (20)

Machine Learning in Finance

Editor's Notes