CFM Challenge - Course Project

Final Report : CFM Challenge
MVA course : Représentations Parcimonieuses
Khalil Bergaoui, Azza Ben Farhat
{khalil.bergaoui;azza.ben-farhat}@student.ecp.fr
March 23rd 2021
1 Introduction
In the context of the MVA course entitled ”Représentations Parcimonieuses”,
we participated in the CFM challenge : Stock Trading Prediction of Auction Vol-
ume. Throughout this report, we will detail the methodology that we adopted
during the challenge, we will present the results that we obtained and compare
them to CFM’s benchmark. Additionnally, we will discuss some of the difficul-
ties that we encountered in this project and discuss our final solution as well as
potential future directions.
In the next section, we will begin by briefly presenting the goal of the CFM
challenge and reviewing the related work.
2 Related Work
The goal of this year’s CFM challenge is to predict the volume (total value of
stock exchanged) available for auction, for 900 stocks over about 350 days. The
problem is thus formulated as a regression problem.
2.1 Litterature Overview
Although the litterature on the topic of auction volume prediction is not partic-
ularly rich, financial time series analysis as well as regression tasks are widely
covered topics in machine learning. Considering the diverse nature of this chal-
lenge’s input data (combination of independent values, identifiers and short
noisy time series, in particular the return and volume features), the auction
volume prediction problem can be tackled from different angles. In this section,
we will briefly present relevant methods to approach the problem that we have
found in the litterature.
While there is no particularly straightforward state-of-the-art method for
auction volume prediction, the basic strategy is to extend the techniques used
for supervised machine learning for regression tasks. In particular we have found
1

that better results are achieved with hybrid autoregressive models as in [5] where
the authors perform the prediction in two steps : first a model is trained to fit
the data, then a second model in trained to fit the difference between the first
model’s predictions and the ground truth data (the residual error). This was in
fact the strategy adopted by CFM’s benchmark model.
2.2 CFM Benchmark
A hybrid model is used as a benchmark in this challenge. In fact, a linear
regression model is trained to predict the auction volume, for a given data
sample x = (xi)1≤i≤126, using all columns except ”pid” : vpred = β0 +
P125
i=1 xiβi
where (βi)i are the regression parameters. Then a tree based ensemble learning
algorithm, LightGBM, includes the ’pid’ information to fit the residual error =
vpred −vtrue between the ground truth target and the linear model’s prediction.
Note that LightGBM is a gradient boosting framework that learns by grow-
ing trees vertically (leaf-wise) and relies on a large set of hyperparameters that
need to be carefully fine-tuned in order to avoid overfitting.
3 Methodology
In this section, we will describe the steps that we carried during the project
and will describe the difficulties that we encountered. The starting point in this
data challenge was data exploration and dealing with noisy samples that took
the form of missing values.
3.1 Missing values
The following table summarizes information about missing data:
Rows with missing data in training set (%) 37%
Rows with missing data in test set (%) 33%
Missing values of absolute returns in train set (%) 5%
Missing values of absolute returns in test set (%) 4%
Table 1: Missing data in the train and test datasets.
Remark : We noticed that if the value of the nth
feature of the absolute
returns abs retn is missing, then the value of the nth
feature of relative volume
rel voln is also missing.
The difficulty when dealing with missing values is that we are unable to quan-
titatively anticipate the impact of the adopted strategy on the performance of
the learning algorithm, which is why we decided to start with simply remplacing
missing values with zeros as it was the case in CFM’s benchmark model.
2

3.2 Feature extraction
Once missing values are replaced, we can start extracting relevant features from
the data to use them as inputs for learning algorithms. In fact, a successful
predictive model would use informative features about the output, which in our
case is the auction volume for a given stock at a given day. So in addition to
the provided input columns, which represent daily information about a given
stock, it might be interesting to exploit the past of the auction volume for a
given stock or the interaction between different stocks. For this we performed
a quick correlation analysis that can be summerized in the below figures:
(a) Within-stock correlation (b) Between-stock correlation
On the left figure, we plot the auto-correlation of the auction volume time
series (over the 800 days in the train set) using different lag values. We randomly
picked two stocks: stock 360 (blue curve) and stock 850 (orange curve). On
the right figure, we display the correlation map of the target (auction volume)
between a random set of 50 different stocks. In both cases, we obtain low
correlation values. In addition, since in the test set we do not always have
access to the auction volume in the preceding days, we decided to focus our
study on the provided input columns only.
3.2.1 Principal Component Analysis
Principal Component Analysis (PCA) is a statistical technique generally used for
dimensionality reduction in machine learning[7]. Geometrically, it corresponds
to a projection method where data with m-columns (features) is projected into
a subspace with m or fewer columns, which are uncorrelated and orthogonal,
called principal components, whilst retaining the essence of the original data by
maximizing the variability of the data set that is contained in the new vectors.
The new vectors are ordered such that the retention of variation present in the
original variables decreases as we move down in the order. So, in this way, the
first principal component retains maximum variation that was present in the
original components.
We applied a PCA transform to the normalized train set without keeping
3

the ”pid”, ”date” and ”ID” features, to look for a possible new set of features
that could be used for training the model. The following table summarizes the
explained variance ratio of the first 5 principal components:
Principal component Variance ratio
1 9.99999560e-01
2 3.91120674e-07
3 4.83349491e-08
4 8.88623697e-11
5 2.14177158e-11,
Table 2: Explained variance ratio of the 5 first principal components
We can see that the first principle component contains almost all of the
variance of the original data. This means that replacing the original data set by
the first principle component is a good approximation, since it explains almost
all the variance of the original data. At the same time, such result implies that
all the columns (”pid”, ”date” and ”ID” excepted) are linearly dependent which
we found a bit surprising. Then, we computed the correlation between the first
principal componant and the output we want to predict (log of the auction
volume) and we have found a relatively low value of 0.18. Our guess is that in
the high dimensional space (126-dimensions), most data points are concentrated
along the same direction (first principal component) which is a biased direction
and we assume, by looking at the distribution of the target (auction volume)
values over the training set, displayed in Figure 3, that the bias corresponds to
data points corresponding to values around −2.
We would like to add that another possible interpretation is that the observed
linear dependency could have been ”fakely” caused by replacing the missing
values with non zero values (in this experiment we replaced each missing value in
a given column, for a given stock, with the average computed accross that same
column, for all rows corresponding to different days of the same stock). However
we repeated the same experiment when replacing with all missing values with
zero and obtained a similar result (99% variance explained by first component
only).
In this case, it seems that linear methods are not sufficient to capture the
specificity of the data set. One could possibly explore non linear methods for
the dimensionality reduction step, but for time constraints we were unable to
do it.
3.2.2 Wavelet Transform
The wavelet transform is a mathematical tool used in signal processing in or-
der to decompose signals over dilated and translated wavelets. In particular, a
wavelet ψδ,σ is a function parametrized with shift and scale parameters allow-
ing to analyze a given signal at multiple resolutions.Whereas there exist broad
4

categories of wavelet functions, we will restrict our application to real wavelets
since they are, in contrast with complex wavelets, often used to detect sharp
signal transitions[9]. In our case, we apply wavelet transforms to detect sharp
transitions in both the return signal and the volume signal (columns abs ret
and rel vol respectively in the input data) as they are likely to be significant
predictive features. In our experiments, we use a real version of the continuous
Morlet wavelet given by : ψ(t) = e− t2
2 cos(5t) such that the continuous wavelet
transform is applied to our discrete data (61 discrete samples for each of the
return and volume signals) as a convolution with the discretized integral of the
wavelet ψ. We use the the CWT implementation of the PyWavelet package [2]
and analyze the signals over a number of 32 scale ( In other words Larger scales
correspond to stretching of the wavelet. For example, at scale=10 the wavelet
is stretched by a factor of 10, making it sensitive to lower frequencies in the
signal).
Figure 2: Wavelet Transform at 32 scales.
Thus the 1D signal is transformed to 32 one-dimensional signals that can be
represented as an image as we can see from the above figure where we illustrate
the relevance of the wavelet transforms in capturing non-stationary transitions.
For instance, around the 15th
period of day 23, the volume of stock 739 ubruptly
increases. This sudden change is reflected in the multiple resolution wavelet
transform displayed below the signal.
Therefore, wavelet transforms represent a good candidate set of features that
allow us to use computer vision techniques in our predictive task as we will see
in more details in section 3.3
5

3.3 Algorithms
3.3.1 Nearest Neighbors
Using the previously described dimensionality reduction technique, PCA, we
are able to map the input features to a low-dimensional space in which the
application of nearest neighbors algorithm is more appropriate[4]. In this case,
given a test data sample (a 126-dimensional vector), we project it on the low-
dimensional space and use its K nearest neighbors among the training points (for
the Euclidean distance) to make the prediction, where K is a hyperparameter
that needs to be finetuned. However, such parameter can be tricky to optimize
since it heavily depends on the density distribution of datapoints in the lower-
dimensional space. Additionally, in our case, the density distribution of training
points in biased towards specific output values, as displayed in the below figure,
which could easily lead us to overfit the training data.
Figure 3: Distribution of the target(auction volume) over the training set.
3.3.2 Ensemble learning : Random Forest
Random forests [3] are an ensemble learning method for classification and regres-
sion tasks that operate by constructing a multitude of decision trees at training
time and outputting the class that is the mode of the classes (classification) or
mean/average prediction (regression) of the individual trees.
It applies the general technique of bootstrap aggregating, or bagging, to tree
learners, where each tree has a random sample with replacement of the training
set and fits trees to these samples. The difference between bagging trees and
random forests is that random forests use a modified tree learning algorithm
that selects, at each candidate split in the learning process, a random subset of
the features. This helps decrease the correlation between the different learners.
Once the trees are trained, the predictions are made by taking the average of
the predictions from all the individual regression trees.
Random forests are more efficient than decision trees because they are more
robust to noise and they overfit less.
6

3.3.3 Ensemble methods : Stacking
Stacking or Stacked Generalization is an ensemble machine learning algorithm
that uses a meta-learning algorithm to learn how to best combine the predic-
tions from two or more base machine learning algorithms. Generally it gives
good results when it combines models that have differnet learning rules and
outperforms the ensemble models when taken individually.
Its architecture is divided into two parts: the base-models or level-0 models
which are the models fit on the training data and used to make predictions,
and the meta-model or the level-1 model is the model who learns how to best
combine the predictions of the base-models. Once the base models are trained,
they are fed with unseen training data, and their predictions are paired with
the ground truth and fed to the meta-model.
It is preferred to use base-models that learn in different ways so that the
errors in predictions made by the models are uncorrelated or have a low corre-
lation. As for the meta-model, it is often a simple model, providing a smooth
interpretation of the predictions made by the base models.
We applied stacking using KNN and Random Forest models as base-models,
and Linear Regression as the meta-model. The prediction results obtained when
we trained the models on the 5 first principal components are presented in the
final table in section 3.4.
3.3.4 Neural Networks
In contrast with the previously described learning algorithms, neural networks
lie within the framework of representation learning, in the sense that they do
not always require carefully designed features since extracting discriminative
features from the input data is part of the learning process. However, the
difficulty lies in the tuning of both the architectural hyperparameters (layer type,
depth, hidden units, activation functions...) and training parameters (batch
size, optimizer ...). In fact, as it is shown in [11] the optimal hyperparameters
depend on the the given dataset and vary significantly from one training task
to another.
In our case, we adapt common deep learning techniques to CFM’s dataset:
- We apply computer vision methods : instead of using RGB images we use
a 2-channel image obtained by stacking the wavelet transforms, computed over
the input data as described section 3.1.2.
- We use stock embedding : this allows us to replace the stock identifier ’pid’
with a trainable real-valued vector of a fixed size d (hyperparameter). Thus the
stocks are represented with a trainable matrix W ∈ RS×d
, where S = 900 is the
number of distinct stocks in the dataset. Note the embedding is relevant in our
case because the same training stocks are present in the test set.
- We use non linear activation functions to capture non-linearity in the data.
Further in this section, we discuss the impact of the choice of the non-linearity.
- We use batch normalization [10] : this allows us, among other things, to
normalize scalar features which are not necessarily of the same order of magni-
7

tude (e.g day and LS columns).
In this perspective, we experiment with different neural network configura-
tions and use the mean squared error over the validation set as decision criterion.
In the below figure, we detail the architecture of the network that achieved low-
est validation MSE in our trials :
Figure 4: Prediction Network architecture.
At each training iteration, the data flows through the network from left
to right, the final layer acting as a linear regression model to transform the
99-dimensional feature vector φ(x) to the predicted auction volume. Then,
the prediction error is the backpropagated via gradient descent to update the
parameters of the network. We use Adam optimizer with a learning rate lr =
0.0001, a batch of 96 datasamples at each iteration and train the network for a
few epochs ≈ 7 until the validation MSE stops decreasing. Note that training
takes roughly 8 minutes for one epoch using mainstream Colab GPU, however
this can be optimized because, in our pipeline, we compute the wavelet transform
on CPU.
After fitting the network, we adopt the hybrid model strategy and train a
second neural network to ”correct” the first network’s predictions. In this step,
the input corresponds to the concatenation of the prediction network’s learned
representation φ(x) and the predicted value. This 100-dimensional vector is
used to train the correction network displayed in the below figure:
8

Figure 5: Correction Network architecture.
Similarly, we train this network using Adam optimizer with a lower learning
rate lr = 10−5
and a batch of 32 datasamples at each iteration. To avoid
overfitting, we train the network to minimize a regularized MSE loss of the
form : MSE(y, ypred) + λ||w||2 ; where λ is the regularization coefficient fixed
at 0.01 and w the weights of the final fully connected layer of the network.
The combination of the prediction network and the correction network, as
a hybrid model, allows us to achieve better performance than CFM’s proposed
benchmark model as reported in section 3.4.
Choice of the activation function:
As widely used in convolutional neural networks, we implemented our net-
work using ReLU (Rectified Linear Unit) as non linear activation function :
ReLU(x) = min(0, x). However, we were unable reach better performance than
the benchmark with architectural changes only. As it turned out, the ReLU
function was the source of a saturation at the output level since our model was
unable to predict values larger than a threshold of ≈ −1.62 which corresponds to
the network’s output bias (i.e the value predicted when the final feature vector
is the null vector). To fix this issue we replaced the ReLU with its Leaky version
with parameter 0.02: LeakyReLU(x) = 0.02 if x 0 and x if x ≥ 0, which
allows the flow of negative values through the network’s layers. As illustrated
in the below figure, this change in the activation function solves the issue.
9

(a) Output distribution with ReLU (b) Output distribution with LeakyReLU
(0.02)
Note that this effect might be caused by not scaling the wavelet transform
arrays to have values in [0, 1] before feeding them to the convolutional network
as it would have been the case when dealing with RGB image.
In addition to this choice of activation function, we made use of the residual
connection[8] in some layers of our network in order to enhance the flow of data.
3.4 Results
In this section, we report the performance of the algorithms presented in section
3 and compare them to CFM’s benchmark. We have found that ensemble learn-
ing through stacking gives better results compared to the one obained from the
base models taken individually. However, the test performance on the challenge
platform is 0.706. The gap between the validation and the test sets shows that
our stacking model overfitted the data. We assume that this is mainly due to
the difficulty in optimally fine-tuning its hyperparameters. We have also found
that training a prediction neural network, as described in section 3, achieves
good results but fails to perform better than CFM’s benchmark and in particu-
lar a neural network trained on wavelet transforms performs better and overfits
less (smaller gap between validation MSE and public MSE) than the same net-
work not trained on wavelet transforms. Additionally, when used as part of a
hybrid model (prediction network and correction network) the neural network
approach achieves lower mean squared error over the private set, compared to
the benchmark.
We summarize numerical results obtained on public and private sets in the
below table :
10

Model Validation MSE Public MSE Private MSE
KNN (Neraest Neighbors) 0.5041 - -
Random Forest 0.5242 - -
Stacking (KNN+Random Forest) 0.4605 0.7068 -
NN (ReLU without WT) 0.4103 0.5435 -
NN (ReLU + WT) 0.4612 0.5272 -
NN (LeakyReLU + WT) 0.4389 0.4963 -
CFM benchmark - 0.4742 0.4735
Hybrid NN 0.4072 0.4677 0.4650
Table 3: Performances of the trained models
4 Discussion
In this report, we have described the approach we adopted during the challenge
and the different methods we implemented. We have managed to improve the
performance of our predictive model and achieve slightly better results than
CFM’s benchmark, by adopting the same strategy of a hybrid predictive model
while changing the preprocessing steps as well as the learning algorithm. In par-
ticular, we have found that neural networks are able to achieve relatively good
results with a suitable architecture, for e.g the architectures that we proposed
in figures 4 and 5. However, we must point out that our final result does not
represent the optimal solution using the pipeline that we have described. In
fact, we did not study the impact of changing all the possible hyperparameters
since it is a tedious task, considering the large number of their possible combi-
nations. Which is why we believe that our final results can be further improved
by changing certain parameters such as the stock embedding dimension, the size
of convolutional and maxpooling operations, the activation function,etc.
Additionally, we believe that further improvements can be achieved by focus-
ing on the preprocessing step and trying to extract even more relevant features
from the raw data. One particularly interesting method we have found in the
litterature can be applied to the return and volume columns by viewing them
as time series and applying Topological Data Analysis tools, as in [6], to extract
significant topological features that would help with the prediction. In fact,
we have experimented with this method, using the open source python package
gudhi [1]. We adopted the same approach described in in order to extract the
L1 norm of the first persistence landscapes [6] of the point cloud represented by
both the return and volume time series. We have found that such topological
feature correlated relatively well with the target (auction volume) : as a mat-
ter of fact, we computed this feature for only 50 rows in the training set (50
different stocks at the same day) and have found a correlation value of 0.426
between the L1 norm and the auction volume. However, this method is very
computationally heavy especially for a large training set as ours. For this reason
mainly, we were unable to test the relevance of this method but we think that
11

it is somewhat promising.
References
[1] Gudhi library for tda : Tutorials. https://github.com/GUDHI/TDA-
tutorial.
[2] Python’s library for continuous wavelet transform.
https://pywavelets.readthedocs.io/en/latest/ref/cwt.html.
[3] Breiman. Random forests, machine learning. Springer, 2001.
[4] Alexander Hinneburg Charu C Aggarwal and Daniel A Keim. On the
surprising behavior of distance metrics in high dimensional space. Springer,
2001.
[5] Simi Haber Daniel Libman and Mary Schaps. Volume prediction with
neural networks. 2019.
[6] Topological Data Analysis for Financial Time Series : Landscapes of
Crashes. Marian gidea, yuri katz. arXiv:1703.04385, 2017.
[7] Ian T. Jolliffe and Jorge Cadima. Principal component analysis: a review
and recent developments. https://doi.org/10.1098/rsta.2015.0202, 2016.
[8] Shaoqing Ren Jian Sun Kaiming He, Xiangyu Zhang. Deep residual learn-
ing for image recognition. arXiv:1512.03385 [cs.CV], 2015.
[9] Stéphane Mallat. ”a wavelet tour of signal processing”, 3ème edition.
https://www.di.ens.fr/ mallat/College/WaveletTourChap4.2.pdf.
[10] Christian Szegedy Sergey Ioffe. Batch normalization: Acceler-
ating deep network training by reducing internal covariate shift.
arXiv:1502.03167v3[cs.LG], 2015.
[11] Xavier Alameda-Pineda Radu Horaud Stéphane Lathuilière, Pablo Mesejo.
A comprehensive analysis of deep regression. arXiv:1803.08450v3 [cs.CV],
2018.
12

CFM Challenge - Course Project

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CFM Challenge - Course Project

Similar to CFM Challenge - Course Project (20)

Recently uploaded

Recently uploaded (20)

CFM Challenge - Course Project