SlideShare a Scribd company logo
1 of 61
Download to read offline
A THESIS PRESENTED FOR THE DEGREE OF MASTER OF SCIENCE IN
COMPUTATIONAL SCIENCE
UNCERTAINTY ANALYSIS OF PREDICTIONS
BY RECOMMENDER SYSTEMS BASED ON
MATRIX FACTORIZATION MODELS
Author:
SARDANA NAZAROVA
11106514
Supervisors:
DR. FABIAN JANSEN
DR. DRONA KANDHAI
DR. VALERIA KRZHIZHANOVSKAYA
Committee members:
DR. DRONA KANDHAI
DR. FABIAN JANSEN
DR. MICHAEL LEES
DR. VALERIA KRZHIZHANOVSKAYA
IOANNIS ANAGNOSTOU (MSC.)
SEPTEMBER 2016
Abstract
In this work a recommender system for financial markets products based on Factorization Machines
is considered. In practice, Factorization Machines are used to give predictions as point numbers without
any uncertainties around them. When giving recommendations, however, it is crucial that some quan-
titative indication of the accuracy of the recommendations is given, so that those who use it can assess
their reliability. While recommender systems mostly provide only the overall accuracy using a validation
dataset, our approach is to give accuracies on all of the individual predicted preferences. In this thesis,
an algorithm that estimates the accuracy of individual predictions made by recommender systems based
on matrix factorization models is created and tested. The method is furthermore applied on emerging
market bonds.
Acknowledgements
I would like to express my deep gratitude to my supervisors, Drona Kandhai, Fabian Jansen, and
Valeria Krzhizhanovskaya for the patient guidance, the valuable comments, remarks and help. Special
thanks to my daily supervisor Fabian. His guidance, inspiration and enthusiastic encouragement made
this thesis work possible. I am extremely grateful to Fabian for introducing me to the world of Data
science and for giving me the opportunity to work with an awesome team of data scientists. I would like
to thank the Wholesale Banking Advanced Analytics team at ING Bank for taking me as an intern and
treating me as an equal.
I would like to express my very great appreciation to Alexander Boukhanovsky, Michael Lees and
Valeria Krzhizhanovskaya for their continuous help throughout the two years at ITMO and UvA.
I owe my sincere thanks to Elena Nikolaeva and Mendsaikhan Ochirsukh for the friendship, kindness
and sense of humour.
Last but not least, I wish to thank my parents Anna Nazarova and Dmitriy Nazarov for their love and
unconditional support.
Sardana Nazarova, September 16, 2016
Table of contents
Introduction 3
1 Recommender systems 5
1.1 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Explicit and implicit feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Confidence in the recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Matrix factorization models 8
2.1 Basic matrix factorization model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Factorization Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 ALS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Uncertainties estimation 13
3.1 Uncertainties for matrix factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Precision matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.2 Covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.3 Individual prediction uncertainties . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.4 Estimation of normalization factor . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.5 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Uncertainties for Factorization Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Uncertainty analysis using synthetic data 21
4.1 Test methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Test of various rating matrices and noise levels . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Test of various sparsities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Estimation of a normalization factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4.1 Determining d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 Calculations on real data 34
5.1 Financial market data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Rating matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3 Implicit feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4 Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4.1 Preference to buy or sell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4.2 Preference to trade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6 Conclusions and future work 47
Bibliography 48
A Minimization methods test 50
B Uncertainties for attribute-aware model 58
2
Introduction
The Wholesale Banking Advanced Analytics team in ING has introduced a recommender system for
trading financial markets products like commodities, bonds and foreign exchange products. The current
recommender system is based on Factorization Machines, particular the software package libFM [1].
However, Factorization Machines give predictions as point numbers without any uncertainties around
them. Hence, ING has no real handle on the accuracy of the recommendations. A usual way of estimating
accuracy is splitting data into train and test sets. However, this method is not suitable in our case. The
reason is that real data on financial markets is highly sparse, so that splitting makes the data even sparser.
Recommender systems generally give recommendations in order of some estimated quantity, a “pref-
erence” or a “like”. When giving recommendations, it is crucial that some quantitative indication of the
accuracy of the recommendations is given, so that those who use it can assess their reliability. While
recommender systems mostly give only overall accuracy using a validation dataset, our approach is to
give accuracies on all of the individual predicted preferences. Without such an indication, it is difficult
to compare predictions, either among themselves or with other given values. It is essential for decision
makers to know how much they can rely on a prediction, as predictions 5±4 and 5±0.2 are different.
It is possible that other models than Factorization Machines are more suitable for financial markets
data. However, the focus of this thesis is on matrix factorization models since they are used within ING
and due to the limited time that is available only basic matrix factorization models and Factorization
Machines are considered.
There are two optimization methods used to train factorization models: stochastic gradient descent
(SGD) and alternating least squares (ALS) [2]. LibFM also has Markov Chain Monte Carlo as an option.
Since matrix factorization is a non-convex problem, these methods could converge to a local minimum.
These methods and their stability are tested and compared.
Overall, the objective of the research focuses on the creation and testing of an algorithm that estimates
the accuracy of individual predictions made by recommender systems based on matrix factorization
models. The algorithm additionally includes tools for estimating the overall noise level in input data.
Problem Statement. The main research question is to formulate a measure of uncertainty for matrix
factorization based recommender systems and apply a Factorization Machine with uncertainty estimate
on financial markets products of ING.
The research questions are:
• Formulate a measure of uncertainty for matrix factorization based recommender systems.
• Test this measure on synthetic and real data.
Thesis structure. The thesis is structured as follows: the first chapter gives a brief description of the
different approaches of recommender systems. The next chapter has descriptions of matrix factorization
models, namely basic matrix factorization, attribute-aware model, and a more detailed description of
Factorization Machines. In addition, this chapter also gives specifications for optimization methods.
Chapter 3 provides the algorithm for estimating an uncertainty on individual recommendations for the
matrix factorization model and for the more general Factorization Machine. Herein we also explain a
way to estimate the noise level in data. The result of testing the algorithm using simulated data is given
3
in the next chapter. Chapter 5 shows the result of applying the method on financial markets data. The
conclusions and recommendations for future work are presented in the last chapter.
4
Chapter 1. Recommender systems
With the increasing amount of available data, users face the problem of finding important data. For ex-
ample, it can take a lot of time for users to find products in a huge catalogue in an internet shop. Thus
recommender systems become essential to filter data, when they are able to make personalized recom-
mendations of possibly useful items to a user ([3], [4], [5], [2]).
Automatically recommending items to users has been of academic interest over the last 20 years, since
recommender systems became an independent research area in early 1990s. There are multiple competi-
tions and conferences on recommender systems. The leading competition is the KDD Cup, data-mining
competition [6]. It is an integral part of the annual ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining (KDD). Recommender systems are not only of academic interest but
essential for industry. The most prominent example is Netflix, which had a huge impact on the field
with their competition to improve their recommender technology, i.e. the Netflix Prize. The Netflix
Prize competition has stimulated a great deal of high quality research. So the KDD Cup’07 focused on
predicting aspects of movie rating behaviour, employing the Netflix dataset [6].
The recommendation problem. A recommender system’s core function is to identify useful items for
the user. In order to do this a recommender system must be able to predict the utility of items, and then
decide which items are worth recommending [3]. “Preferences” or “likes” are predicted using data of
users, items, the past history of user preferences and other data. Commonly said, the recommendation
problem is estimating unknown “ratings”, which are the preferences or likes expressed by users for
items. General procedures for estimating ratings are described in [3]. Rating is represented as a user,
item, rating triple (ci,sj,rij). A user is defined with a profile ci that includes user attributes, such as
gender, age, etc. It also can be just a unique identifier. The same is true for an item sj. Then the general
rating estimation procedure can be defined as ˆrij = uij {R,c,s}, where R = rij = /0 , c - user profiles, s
- item profiles and uij - the utility function.
1.1 Approaches
Depending on the utility function there are three approaches in recommender systems: content-based,
collaborative filtering, and hybrid [3]. The content-based approach recommends items to a user that
are similar to the ones the user has rated before. Oppositely, collaborative filtering relies only on the
similarity in the past behaviour of the users. Hybrid recommenders combine collaborative and content-
based approaches.
Content-based systems calculate the similarity between an item and items that the user previously
rated, and recommend the best-matching ones. For that purpose the system uses item profiles. User
preferences are built on the basis of items previously rated by the user. Once user and item profiles are
available, similarities can be calculated using, for example, the cosine similarity measure.
uij = cos(ci,sj) =
ci ·sj
||ci||2 ·||sj||2
5
As an example, in keyword analysis sj can be some weights of keywords of a document j and ci be the
importance of these words to user i. As a result the user will get recommendations of documents from his
field of interest. Other machine learning techniques such as decision trees and artificial neural networks
are used as well. For examples of content-based methods look at [4], providing a thorough discussion
on a wide variety of topics and methods. However, content-based strategies suffer from limited content
analysis problems (requiring external information that might be hard to collect or are unavailable and
that two different items with the same set of features are not distinguishable) and over-specialization
(recommendations tend to be limited by items similar to those already rated).
Collaborative filtering bases its predictions on ratings by other users. There are two areas of collabo-
rative filtering algorithms: memory-based and model-based [5].
Memory-based (nearest neighbor) methods rely on relations between items or between users. User-
oriented ones measure a user rating for an item based on all the ratings of “neighboring” users for the
item. Finding “neighboring” (like-minded) users is done by evaluating the similarity (correlation, co-
sine, their modifications, or other) between two users on their ratings of items that both users have rated
[3]. The same methods are used for calculating item similarities in item-oriented methods. As described
in [4] memory-based methods suffer from some disadvantages. The main drawback of user-oriented
methods is that they would not perform well in case of high sparsity of ratings R, where there are few
coratings, since the similarity measure needs sets of items rated by both users. Another disadvantage
is expensive time- and memory-consuming calculations of user/item neighborhoods, as it requires com-
parison against all other users/items. There are several techniques, such as clustering and subsampling,
used to reduce time or memory consumption in user-oriented methods. For item-oriented methods the
consumption is reduced by only storing the top n correlations for each item and calculating correlations
for item pairs with more than k coratings, though it reduces the accuracy of the predictions. Also expen-
sive similarity calculations can be done offline like the Amazon recommender [7]. The problem of large
sparse ratings data with few coratings may be solved with dimensionality reduction algorithms, mapping
the underlying data to a latent space of smaller dimensionality [4]. The most well-known dimension-
ality reduction algorithm is matrix factorization, which belongs to model-based methods. Model-based
methods use the collection of ratings to learn a model from data using statistical and machine learning
techniques and then apply this model to get predictions.
1.2 Explicit and implicit feedback
Recommender systems use different types of data. Those are categorized into explicit feedback and
implicit feedback. The first uses explicitly given user preferences showing how much a user likes or
dislikes an item on some rating scale. For example, the Netflix Prize data are ratings on a scale from 1
(“totally dislike”) to 5 (“very like”) [6]. Explicit feedback is more favourable, since it represents user
preferences fully. However, it is not always possible to collect explicit feedback. When this is the case,
user preferences may be deduced indirectly by recommender systems from user actions, for instance
purchase history.
Unfortunately, implicit feedback has characteristics that may prevent the direct use of algorithms
designed for working with explicit feedback [8]. The numerical value of implicit feedback describes the
frequency of actions, indicating confidence. Recurring events are more likely to reflect user opinion, thus
by observing a user’s behaviour it may be inferred which items that user probably likes, yet it is only a
guess if the user actually likes the item. The main problem is that it is mostly impossible to reliably infer
which items a user did not like. The fact that a user has no relations with an item usually means the user
hasn’t known about that item, hence implicit feedback does not represent negative preferences.
Preferences rui of user u towards item i may be get from implicit feedback assuming that high val-
ues of observations mean stronger preference. For example, rui can indicate the number of times user
u purchased item i. There are more complicated ways to get preferences. For instance, Hu, Koren and
6
Volinsky [8] build preferences from user behaviour using confidence. They introduced two sets of vari-
ables indicating preference and the confidence level in observing that preference. A set of variables pui
shows the preference of user u to item i, i.e. if u consumed i, it means u likes i (pui = 1), otherwise there
is no preference (pui = 0):
pui =
1 if rui > 0
0 if rui = 0
These preferences have different confidence levels. Usually higher rui (more frequent) means stronger
indication that the user likes the item. For a measure of confidence in observing pui two equations are
proposed: cui = 1 + αrui and cui = 1 + α log(1 + rui/ε). The cost function for minimization procedure
is changed to min∑u,i cui (pui − ˆpui)2
.
Financial markets data does not give explicit feedback and ratings should be derived from implicit
feedback.
1.3 Confidence in the recommendation
It was shown in [9] that providing an explanation on the recommendation may influence users to make the
right decision. Explanations provide a mechanism for handling errors that come with a recommendation.
One of explanations is the confidence in the recommendation. The confidence in the recommendation is
the system’s trust in its recommendations or predictions [10]. Presenting confidence in predictions can
provide valuable information to users in making their decisions.
Depending on an algorithm used in a recommender and a task, the confidence in the prediction may
be obtained differently [11]. In general, collaborative filtering algorithms become more confident in a
user predictions when there are more known ratings by the user, the same for items. Hence, a confidence
measure can simply be associated with the amount of information over a user and an item in the dataset.
It can be interpreted as the recommendation strength - strong or weak, when the system is confident or
unsure if the item is appropriate in the recommendation respectively.
The most common measurement of confidence is the confidence interval. The probability distribu-
tion of ratings for an item is obtained differently in memory-based and model-based methods [12]. In
memory-based methods, the rating’s distribution is obtained from the data. As an example, in nearest-
neighbor techniques the rating’s distribution obtained based on the similarity of the k nearest users to
the user. For model-based methods it is possible to obtain the rating’s distribution from the model itself
using uncertainty quantification techniques.
Matrix factorization models do not provide any information on the uncertainty and the confidence
of the recommendation [13]. Karatzoglou and Weimer introduced an algorithm that estimates the con-
ditional quantiles of the ratings in matrix factorization using quantile regression. It belongs to intrusive
uncertainty quantification methods. This thesis focuses on non-intrusive uncertainty quantification meth-
ods, since intrusive methods require reformulating the mathematical models.
Measuring the quality of confidence is difficult. In [14] two possible ways of evaluating confidence
bounds are given. If a recommender is trained over a confidence threshold α, for instance 95%, and
produces, along with the predicted rating, a confidence interval, the true confidence can be computed as
αtrue = n+
n−+n+
, where n+ and n− are the number of times that predicted ratings were within and outside
the confidence interval respectively. The true confidence should be close to the requested α confidence.
Another one is in filtering recommended items where the confidence in the predicted rating is below
some threshold. Then the prediction accuracy can be estimated for different filtering thresholds. The
recommendations made with high confidence, in general, should be more accurate than those made with
lower confidence.
7
Chapter 2. Matrix factorization models
Matrix factorization models try to model ratings by characterizing users and items by latent factors.
Matrix factorization models, considered in this work, are a basic matrix factorization, an attribute-aware
model and Factorization Machines. The basic one uses only ratings data and hence represents model-
based collaborative-filtering approach; an attribute-aware model also incorporates content data and can
be considered as hybrid approach; the Factorization Machine is able to generalize most of all state-of-
the-art matrix factorization models.
2.1 Basic matrix factorization model
In recommender systems based on collaborative filtering input data can be represented as a matrix of
ratings R ∈ Rn×m:
R =





r11 r12 ··· r1
r21 r22 ··· r2m
...
...
...
...
rn1 rn2 ··· rnm





,
where n is a number of users, m is a number of items, and rij represents the rating given by user i to item
j. Usually R is a sparse matrix, since users are likely to give ratings to only a small number of items
compared to the total number of items available. In matrix factorization models it is assumed that a
matrix of ratings can be represented by a small number of latent factors. This allows matrix factorization
to estimate factors even in highly sparse data. Matrix factorization models map both users and items
to a joint latent factor space of dimensionality d, such that user-item interactions are modeled as inner
products in that space [2].
Rn×m
An×d
·Bd×m
(2.1)
Each item i is associated with a vector ai ∈ Rd representing latent factors of the item. Each user j is
Figure 2.1: Matrix factorization example
associated with a vector bj ∈ Rd, which may be considered as indications of how much the user prefers
each of the d latent factors (see an example in figure 2.1). Hence, each user-item interaction, including
8
unknowns, can be approximated as:
ˆrij = ai ·bj (2.2)
As some users tend to give higher/lower ratings than the average and some items have given in average
higher/lower ratings, biases also should be included in a model. The biased matrix factorization model
is
ˆrij = wi +wj +ai ·bj (2.3)
where wi denotes the bias for the user and wj for the item. However, it was shown in [15] that the
bias terms are not necessary, because they can be incorporated directly into matrix factorization. They
proposed a BRISMF model (Biased Regularized Incremental Simultaneous Matrix Factorization) which
can incorporate the bias terms by fixing d +1 column of A and d +2 row of B to the constant value of 1.
Then (2.3) becomes
ˆrij = wi +wj +
d
∑
f=1
a
f
i b
f
j =
d+2
∑
f=1
a
f
i b
f
j ,
where a
f+1
i = 1, b
f+1
j = wj, b
f+2
j = 1, a
f+2
i = wi.
Generalization to an attribute-aware model. Treating every user as a combination of user attributes in
the form of a
f
i = ∑n
s=1 us
i a
f
s , and every item as a combination of item attributes b
f
j = ∑m
q=1 p
q
jb
f
q, leads
to the model (2.2) to be generalized into an attribute-aware model [16]. The latent factors part (without
biases or unary interactions) of the attribute-aware model is
ˆrij =
d
∑
f=1
n
∑
s=1
us
i af
s ·
m
∑
q=1
p
q
jbf
q (2.4)
where u1×n
i , p1×m
j are user i and item j attribute vectors, an×1, bm×1 are latent factors, d is the number
of latent dimensions. In matrix notation this is
ˆR = (UA)·(PB)T
(2.5)
The model (2.5) becomes (2.2), if U ∈ Rn×n and P ∈ Rm×m are identity matrices with sizes n, m equal
to a number of users and items respectively:
UA =





1 0 ··· 0
0 1 ··· 0
...
...
...
...
0 0 ··· 1





×





a1
1 a2
1 ··· ad
1
a1
2 a2
2 ··· ad
2
...
...
...
...
a1
n a2
n ··· ad
n





=





a1
1 a2
1 ··· ad
1
a1
2 a2
2 ··· ad
2
...
...
...
...
a1
n a2
n ··· ad
n





In this representation for user i, being a user i is its attribute itself.
Generalization to a Factorization Machine. The difference between an attribute-aware model and a
more general Factorization Machine is that the latter contains additional interactions between user at-
tributes (e.g., between age and salary) and item attributes [1].
ˆrij = ∑
f






n
∑
s=1
us
i af
s ·
m
∑
q=1
p
q
jbf
q
Attribute-aware model
+
n
∑
s=1
us
i af
s ·
n
∑
s’=s+1
us’
i a
f
s’
User attr-s interactions
+
m
∑
q=1
p
q
jbf
q ·
m
∑
q’=q+1
p
q’
j b
f
q’
Item attr-s interactions






(2.6)
9
If user and item attributes are combined into one vector xij = ui, pj , the above equation becomes
ˆrij = ˆy(xij) =
k
∑
s=1
k
∑
s’=s+1
xsxs’
d
∑
f=1
vf
s v
f
s’ (2.7)
where k = n+m, v = a,b .
2.2 Factorization Machines
The Factorization Machine is a general approach able to mimic most factorization models using feature
engineering [1]. It models interactions between all input variables using factorized interaction parame-
ters.
Figure 2.2: Example (from [1]) for representing a recommender problem with a design matrix X and a
vector of targets y. Every row represents a feature vector xi with its corresponding target yi
The input data for the Factorization Machine is described by a design matrix X ∈ Rn×k and a vector
of targets y ∈ Rn (figure 2.2). Each row xi ∈ Rk of X is a vector of real-valued features, describing
one case with k variables, and yi is the prediction target of that case. The variables in X may be binary
indicators, user and item attributes, time indicators or any real-valued features. The model is
ˆy(x) = w0 +
k
∑
s=1
wsxs +
k
∑
s=1
k
∑
s’=s+1
xsxs’
d
∑
f=1
vf
s v
f
s’,
where d is the dimensionality of the factorization and the model parameters are w0 ∈ R,w ∈ Rk,V ∈
Rk×d. The first part of the model is an overall bias and the unary interactions of each input variable with
the target (like in a linear regression model). The second part contains all pairwise interactions of input
variables with factorized weights, assumedly having a low rank. The Factorization Machine model is
greatly adjustable. One can choose whether to include bias and/or unary interactions and the number of
the dimensions of the factorization. Also, adjustment can be done by specifying differently the design
matrix X. For example, a basic matrix factorization model ˆR = A·BT can be obtained by not including
bias and unary interactions and constructing x ∈ R|A|+|B| with binary indicator variables (for every known
rij the feature vector is constructed as xi = 1 : i = {|u|,|A|+ p},otherwise 0). The model is then:
ˆrij = ˆy(xij) =
d
∑
f=1
v
f
i v
f
j
With bias and unary interactions this model becomes a biased matrix factorization model (see eq. 2.3):
ˆrij = ˆy(xij) = w0 +wi +wj +
d
∑
f=1
v
f
i v
f
j
10
Though the bias terms may improve accuracy, it will not be included in further models for testing and
calculations of uncertainties as they may be included in the factorization part itself.
2.3 Optimization algorithms
Due to the high sparsity of the input data it is challenging to learn model parameters. The state-of-
the-art models are learned only on the observed cases. Addressing only a few known cases is prone to
overfitting, but this may be avoided by regularization (see the effect of regularization in table A.3).
The model is learned by minimizing the cost function - the regularized sum of losses l over the
observed data S:
L(S) = argmin
Θ
∑
(x,y)∈S
l (ˆy(x|Θ),y)+ ∑
θ∈Θ
λθ θ2
, (2.8)
where Θ are the model parameters, ˆy(x|Θ) is a prediction depending on chosen Θ. Here L2 regularization
is applied. A loss functions represent the price paid for an inaccuracy of predictions. In Factorization
Machines a least-squares loss function (L2 norm) is chosen for a regression task:
l (ˆy(x|Θ),y) = (ˆy(x|Θ)−y)2
,
for a binary classification task y ∈ {−1;1} it is a logistic function:
l (ˆy(x|Θ),y) = ln 1+exp−ˆy(x|Θ)y
.
Learning algorithms that Factorization Machines use are based on stochastic gradient descent (SGD),
alternating least-squares (ALS), and Bayesian inference using Markov Chain Monte Carlo (MCMC).
2.3.1 SGD
The stochastic gradient descent algorithm is popular for optimizing factorization models. It is simple
and computationally light. The algorithm iterates over known cases (x,y) ∈ S and updates the model
parameters θ ∈ Θ in the opposite way of the loss function gradient
θ ← θ −η
∂l(ˆy(x|Θ),y)
∂θ
+2λθ θ , (2.9)
where η is the learning rate, λθ are regularization values. The prediction quality highly depends on
the regularization values (figure A.3). Those are usually searched by time-consuming grid searches that
require learning the model parameters multiple times. To overcome this issue Rendle [17] introduced
SGD with adaptive regularization (SGDA). This method adapts the regularization values automatically,
while training the model. SGDA has two alternate steps, the first one is updating the model parameters
from a train set (2.9), the second one is updating the regularization values from a validation set:
λ ← λ −α
∂l(ˆy(x|λ),y)
∂θ
However, the main shortcoming of SGD and SGDA over complex algorithms is that they highly depend
on a learning step η. Though it is proved that gradient descent converges with infinitesimal steps for
low-rank matrix approximation of rank higher than one, if η is too small, the algorithm will converge
slowly; if η is high, it may not find any minimum (figures A.1, A.2).
11
2.3.2 ALS
It is shown in [18] that alternating least squares works well for low-rank matrix reconstruction. ALS
iterates over all parameters minimizing the loss per model parameter until convergence. The optimal
value of one parameter may be calculated directly if all remaining parameters are fixed, since the eq.
(2.8) becomes a linear least-square problem for one parameter. While in general ALS iteration has higher
runtime complexity, implementation in Factorization Machines compute this with the same complexity
as iteration in SGD due to caching [19]. An advantage of ALS over SGD is that ALS does not use the
learning rate.
2.3.3 MCMC
Both ALS and SGD algorithms learn optimal parameters Θ, which are used for getting a point esti-
mate of ˆy(x|Θ). Markov Chain Monte Carlo generates the distribution of ˆy by sampling. ALS can be
considered as simplified MCMC without sampling — while MCMC samples parameters from the pos-
terior distribution θ ∼ N (µ∗
θ ,σ∗
θ
2
), ALS uses the expected value θ = µ∗
θ . MCMC inference is simpler
to apply than other methods, since it has fewer parameters to adjust. It automatically determines the
regularization parameters ΘH by placing a prior distribution on them. The algorithm iteratively draws
samples for hyperparameters ΘH and model parameters Θ. The advantage of MCMC over SGD and
ALS is that MCMC takes the uncertainty in the model parameters into account. The disadvantage is that
MCMC has to store all generated models, which are used to calculate predictions. For large scale models
like the Factorization Machine, saving the whole chain of models is very storage intensive [20]. libFM’s
implementation of MCMC doesn’t store the set of models, but sums up predictions on every step and at
the end gives an average as a result. That’s why to get reliable predictions, MCMC needs much more
iterations than ALS.
Tests of listed methods are shown in the appendix A. SGD methods highly depends on a learning
step η: if η is too small, the algorithm converges slowly; if η is high, it may not find any minimum.
For low sparsity, the high value of regularization parameter λ prevents SGD from finding optimum and
makes ALS more stable. For high sparsity, using regularization increases accuracy. SGDA performs as
good as SGD with a proper chosen regularization parameter value. MCMC performs significantly better
than other examined methods both for dense and sparse matrices.
2.4 Software
There are several implementations of Factorization Machines. The most used one is Steffen Rendle’s
library libFM [1]. libFM is written in C++ and there is a Python wrapper for libFM, pywFM, which pro-
vides a full functionality of libFM. There are other Python implementations of Factorization Machines:
pyFM and fastFM. It is possible that fastFM works faster than libFM, since the performance critical
code in fastFM is written in C and wrapped with Cython, however there is no literature on pyFM and
fastFM performance and reliability. Factorization machines are computationally expensive to scale to
large amounts of data and large numbers of features. These problems are solved with distributed imple-
mentations of Factorization Machines – DiFacto [21]. When using multiple workers DiFacto converges
significantly faster than libFM. In this work libFM is used, since it is most reliable and has a Python
wrapper.
12
Chapter 3. Uncertainties estimation
ˆR = A · BT (2.1) gives an estimate of all ratings, but it can be quite inaccurate due to uncertainties. For
example, the estimate can be rij = 4 ± 0.1 or rij = 4 ± 3. The former is much more reliable. Not only
should an estimate be provided, but also the uncertainty on that estimate.
The first section of this chapter focuses on an analytical algorithm to estimate uncertainties on indi-
vidual predictions for a basic matrix factorization model described in section 2.1, which is a particular
case of Factorization Machines. Then the equations are generalized to Factorization Machines (section
3.2). In the analytical algorithm, based on the Maximum likelihood estimator for the parameter variance,
primarily uncertainties on parameters are determined, then using linear approximation the uncertainties
on individual predictions are calculated.
Another method of estimating uncertainties is a Bootstrapping method (section 3.3). In the Boot-
strapping method the uncertainties can be estimated directly without computing parameter variances and
without linearization.
3.1 Uncertainties for matrix factorization
Maximum likelihood method is a method of estimating unknown parameters whose values maximize
the probability of obtaining the observed sample [22]. The likelihood is the value of a probability density
function evaluated at the measured value of the observables:
L(Θ) = f(r ≡ rknown|Θ), (3.1)
where Θ is a set of parameters.
The joint density function for independent and identically distributed samples is f(r ≡ rknown|Θ) =
∏ij f(rij|Θ), therefore L(Θ) = ∏ij f(rij|Θ),∀rij ∈ rknown. For convenience, it is better to work with
log-likelihood: lnL(Θ) = ∑ij ln f(rij|Θ).
The function L can be maximized by setting to zero a first partial derivative by θ ∈ Θ and finding the
obtained equations’ solutions:
∂
∂θ
lnL(Θ)
θ= ˆθ
= 0
From Central limit theorem it can be assumed that the error on rij is Gaussian and, therefore, each data
point is described with a Gaussian probability function. Using a Gaussian probability density function,
lnL(Θ) can be written as:
lnL(Θ) = −
1
2 ∑
ij
rij − ˆrij(Θ)
σ
2
= −
1
2
χ2
(3.2)
χ2
= ∑
ij
rij − ˆrij(Θ)
σ
2
=
RSSmin
σ2
, (3.3)
where RSSmin = ∑ij rij − ˆrij( ˆΘ)
2
,∀rij ∈ rknown.
In this case maximizing L(Θ) is equivalent to minimizing the chi-squared value χ2. Assuming that
13
the log-likelihood function is a parabola around the minimum, parameter variances can be calculated
in a closed-form solution. An error on a parameter is defined as the change of the parameter which
produces a change of the function χ2 value equal to 1 [23] (figure 3.1). Then with error-propagation the
Figure 3.1: Parabolic error
uncertainties on individual recommendations are calculated (section 3.1.3).
The Maximum likelihood estimator for the parameter variance is estimated from the inverse of the
second derivative of −lnL(θ) at the minimum, e.g. θ = ˆθ.
E−1
= −
∂2 lnL
∂2θ
−1
= 2
∂2χ2
∂2θ
−1
(3.4)
3.1.1 Precision matrix
A second order Taylor’s expansion of χ2 around a minimum is:
χ2
(Θ) χ2
( ˆΘ)+
1
2 ∑
i
∑
j
∂2χ2
∂θi∂θj
ˆΘ (θi − ˆθi)(θj − ˆθj)
In matrix notation
χ2
(Θ) χ2
( ˆΘ)+ Θ− ˆΘ
T
·E· Θ− ˆΘ
with Eij = 1
2
∂2χ2
∂θi∂θj
ˆΘ .
E is the precision (inverse covariance) matrix. The non-zero elements of the precision matrix are:
∂2χ2
2∂2al
= −
1
σ2
∂ ∑m
j=1 rl j −albj bj
∂al
=
1
σ2
m
∑
j=1
b2
j,∀rl j ∈ rknown
∂2χ2
2∂2bk
= −
1
σ2
∂ (∑n
i=1 (rik −aibk)ai)
∂bk
=
1
σ2
n
∑
i=1
a2
i ,∀rik ∈ rknown
∂2χ2
2∂al∂bk
= −
1
σ2
∂ ∑m
j=1 rl j −albj bj
∂bk
=
1
σ2
(2albk −rlk) ≈
1
σ2
albk
In more general case, when there are d factors, the non-zero elements of E are:
Ea
f
l ,a
g
l
=
∂2χ2
2∂a
f
l ∂ag
l
= −
1
σ2
∂ ∑m
j=1 rl j −∑d
p=1 ap
l bp
j b
f
j
∂ag
l
=
1
σ2
m
∑
j=1
b
f
j bg
j (3.5)
14
Eb
f
k ,b
g
k
= −
1
σ2
∂2 lnL
∂b
f
k ∂bg
k
=
1
σ2
n
∑
i=1
a
f
i ag
i (3.6)
Ea
f
l ,b
g
k
= −
1
σ2
∂2 lnL
∂a
f
l ∂bg
k
=
1
σ2
ag
l b
f
k (3.7)
Here indices of E show the elements in E corresponding to their variables. σ is an unknown normaliza-
tion factor (section 3.1.4). The parameter variances will be proportional to this factor [23].
The precision (inverse covariance) matrix E ∈ R(n+m)·d×(n+m)·d is constructed using (3.5), (3.6) and
(3.7). The precision matrix is a symmetric positive semi-definite matrix of the second order partial
derivatives of a multivariate function −lnL, at the solution point. It is a measure of the uncertainty of
the least-squares problem through its relationship to the inverse error covariance.
3.1.2 Covariance matrix
A covariance matrix is the inverse of the previous section’s precision matrix:
∆θi∆θj = E−1
,
where ∆θ are a variances in the parameters θ ∈ Θ. The variance on parameters are diagonal elements of
the covariance matrix, and the uncertainties are square roots of the variances:
∆θi = [E−1]ii (3.8)
Unfortunately, in this case the precision matrix is singular due to non-independent parameters.
Parameters are not independent. In general, for the matrix factorization model, the rank of inverse
covariance matrix E ∈ R(n+m)·d,(n+m)·d is equal to
(n+m)·d −d2
, (3.9)
where d is the number of latent factors dimensions. There are d2 non-independent parameters. This can
be proved as follows.
Let R ∈ Rn×m be a rating matrix, factorized into the users’ and items’ latent factors matrices A ∈ Rn×d
and B ∈ Rm×d (eq. 2.1), so R = A·BT . Then for an arbitrary nonsingular matrix C ∈ Rd×d: R = A·BT =
ACC−1BT = (AC)(C−1BT ). Since C has d2 elements, there are d2 parameters, which can be manipulated
without changing the product A·BT . The most trivial case showing non-independent variables is that in
R = A·BT dividing A by any non-zero real number k and multiplying BT to the same number k does not
change the multiplication result A
k · kBT = A·BT (see an example on a figure 3.2).
Solution. To solve the problem, dependent parameters should be fixed. The proposed method for
getting the pseudo-inverse E+ is to eliminate zero eigenvalues in the eigendecomposition of the ma-
trix. As the eigendecomposition is a particular case of SVD for a positive semi-definite normal matrix,
the pseudo-inverse may be calculated by the Moore-Penrose pseudo-inverse of a matrix using its sin-
gular value decomposition (SVD). If E = UΣVT is the SVD of E, then the pseudo-inverse of E is
E+ = UΣ+VT , where Σ+ is the diagonal matrix consisting of the reciprocals of E’s singular values,
followed by zeros.
15
Figure 3.2: Example of dependent parameters
Eigendecomposition of a matrix. If E is a symmetric (n×n) matrix with n linearly independent eigen-
vectors, then E can be factorized as
E = QΛQ−1
, (3.10)
where Q is the square matrix whose ith column is the eigenvector qi of E and Λ is the diagonal matrix
whose diagonal elements are the according eigenvalues, i.e., Λii = λi. The eigenvalues are non-negative
because E is positive semi-definite.
Then the inverse of E is given by
E−1
= QΛ−1
Q−1
,
where Λ−1
ii
= λ−1
i .
Zero eigenvalues truncation scheme. As there are d2 zero eigenvalues, E is not invertible in the math-
ematical sense. To fix dependent parameters, zero eigenvalues must be truncated by setting their inverse
to zero.
Λ−1
ii =
1
λi
if λi = 0
0 if λi = 0
This method was used in [24] to remove not valuable components.
Interpretation. As it was assumed before, variables r1,··· ,rn are normally distributed. They may be
correlated in a normal distributed way.
So the joint probability is
P(x) ∼ exp−1
2 [r1,···,rn]·E·[r1,···,rn]T
= exp−1
2 rT ·E·r
, (3.11)
where ri =< ri > +∆ri and E is the inverse of the covariance matrix of the variables r, also known
as the error matrix. Usually in real cases r are statistically not correlated, so E is diagonal matrix
E =



1/σ2
1 ··· 0
...
...
...
0 ··· 1/σ2
n


, and P(r) ∼ exp
−1
2 ∑n
i=1
r2
i
σ2
i ,
where σ2
i is the inverse of the diagonal element of Eii.
After substituting the eigendecomposition (3.10) of a matrix E = QΛQ−1 into (3.11) we have P(r) ∼
exp−1
2 rT ·Q·Λ·Q−1·r.
Defining y = Q−1 ·r ⇐⇒ r = Q·y gives P(y) ∼ exp−1
2 yT ·Λ·y.
16
Since Λ is the diagonal matrix of eigenvalues, y’s are independent: P(y) ∼ exp−1
2 ∑n
i=1 λiy2
i with λi = Λii
the eigenvalues. If eigenvalues λ’s are sorted decreasingly and k ≤ n is a number of non-zero eigenvalues
λ’s, then P(y) ∼ exp−1
2 ∑k
i=1 λiy2
i . If one of the eigenvalues λi is zero, then accordingly yi can be any real
number and may be fixed at zero. The rest of y’s have average yi = 0; variance (yi − yi )2 = 1/λi;
covariance (yi − yi )(yj − yj ) = 0, since the y’s are independent. The r’s are rm = ∑i∈I Qmiyi – sum
over y’s with corresponding non-zero eigenvalue, because the ones with zero eigenvalue were fixed at
zero.
Note: in programming languages, functions calculating eigenvalues usually give back floating point
numbers close to zero for zero eigenvalues. It means comparing eigenvalues to 0 will not work properly.
It is known that k = rank(E) variables are independent, what means only k eigenvalues are not zero. In
practice, eigenvalues should be sorted decreasingly and the first k eigenvalues and their eigenvectors are
taken. Other values should be considered to be 0.
3.1.3 Individual prediction uncertainties
An uncertainty on an individual prediction stems from a standard deviation of the estimator rij of R,
found using (2.2). As expectations and variances on parameters ai, bj are known, the derivation of a
variance on rij can be done using the function (2.2).
To find the variance of rij a first-order Taylor expansion of (2.2) is written:
ˆrij = ai ·bj =
d
∑
f=1
a
f
i b
f
j =
d
∑
f=1
µ
f
i ν
f
j +
d
∑
f=1
∆a
f
i ν
f
j +
d
∑
f=1
µ
f
i ∆b
f
j +
d
∑
f=1
2∆a
f
i ∆b
f
j ,
where µ
f
i ,ν
f
j are expectations of variables a
f
i , b
f
j correspondingly,
∆a
f
i , ∆b
f
j are their variances.
From this equation the variance on rij can be derived:
∆r2
ij =
d
∑
f=1
a
f
i b
f
j −
d
∑
f=1
µ
f
i ν
f
j
2
d
∑
f=1
∆a
f
i ν
f
j + µ
f
i ∆b
f
j +2∆a
f
i ∆b
f
j
2
∆r2
ij = ∑
f,g
ν
f
j νg
j ∆a
f
i ∆ag
i +∑
f,g
µ
f
i µg
i ∆b
f
j ∆bg
j +2∑
f,g
µ
f
i νg
j ∆ag
i ∆b
f
j +O(∆3
) (3.12)
Here, ∆3 are not considered, for they are considered very small. Uncertainties on individual recommen-
dations are the standard deviations ∆r2
ij.
3.1.4 Estimation of normalization factor
The overall normalization factor σ can be calculated with using goodness-of-fit measure. If we assume
a normally distributed population with a standard deviation σ, then the residual sum of squares (RSS),
divided by σ2, has a chi-squared distribution with N degrees of freedom.
χ2
= ∑
ij
(rij − f(xij))2
σ2
ij
RSS = ∑
ij
(rij − f(xij))2
If f(x) describes the data then the expected value for the chi-squared χ2 = Ndof . If the function has
been fitted to the data, then the fact that parameters have been adjusted to describe the data has to be
17
accounted for. This leads to Ndof equalling the number of degrees of freedom, which is the difference
between total number of data points and number of independent parameters:
Ndof = Ndata −Nparams (3.13)
If data uncertainties σ are not known, then the χ2
min of the fit provides an estimate for these. A rough
estimate of σ is obtained by setting
χ2
Ndof
= 1 (3.14)
From that it can be obtained
σ2 RSS
Ndof
(3.15)
Ndof must be positive number: if the number of independent parameters is higher than total number
of data points, the model overfits and there is no need for calculating uncertainty at all. This leads to the
requirement for estimation of measurement uncertainty:
Ndata > Nparams (3.16)
3.1.5 The algorithm
The analytical algorithm for calculating uncertainties on individual recommendations works with as-
sumption that recommendations are normally distributed. Here, the uncertainty means ”one–standard
deviation” error, so rij ±∆rij implies confidence level ≈ 68%.
1. Factorize matrix of ratings
2. Construct precision matrix E
3. Find covariance matrix V = E−1
(a) Find eigenvalues and eigenvectors of E
(b) Calculate rank(E)
(c) Calculate inverse of E using Eigenvalue truncation scheme
4. Calculate uncertainties on individual recommendations
5. Estimate normalization factor and normalize
3.2 Uncertainties for Factorization Machines
Precision matrix. For the Factorization machine model the precision matrix is calculated the same way
as for the matrix factorization model. The only difference is f(x) for calculating derivatives is given by
equation (2.7). For derivation convenience, the equation (2.7) may be written as:
r(x) =
1
2
d
∑
f=1


k
∑
l=1
v
f
j xj
2
−
k
∑
l=1
v
f
j
2
x2
j


∂r(x)
∂v
f
s
= xs
k
∑
j=1
xjv
f
j −xsvf
s
Its precision matrix is
Ev
f
s ,v
g
q
= ∑
i
∂r
∂v
f
s
∂r
∂vg
q
= ∑
i
xs
i x
q
i
k
∑
l=1
xl
iv
f
l −xs
i vf
s
k
∑
l=1
xl
ivg
l −x
q
i vg
q ,
where xi are feature sets of known data points.
18
Covariance matrix. For factors dimension d = 1, E is an invertible matrix, since all parameters are
independent, and a covariance matrix can be found easily as V = E−1. With higher factors dimension
there are non-independent parameters and the precision matrix is singular: the zero eigenvalues trunca-
tion scheme, described in 3.1.2, can be used to fix the dependent parameters.
Individual recommendation uncertainties. As expectations and variances on parameters v are known,
uncertainties on individual predictions ri can be calculated using the equation (2.7).
First, the first order Taylor series approximation should be done:
T (r(x;v)) = r(x;µ)+
d
∑
f=1
k
∑
l=1
(v
f
l − µ
f
l )
∂r
∂v
f
l v=µ
+O(∆2
), (3.17)
where O(∆2) is the remainder term representing all the higher terms, which is considered small.
The approximation of the variance is
Var(r) Var(T(r))+O(∆3
)
As all the components, except vl(l ∈ {1···k}), in the equation (3.17) are constants, the variance of y is
Var(r) Var
d
∑
f=1
k
∑
l=1
v
f
l
∂r
∂v
f
l v=µ
It is the variance of a linear combination:
Var ∑
f
k
∑
l=1
v
f
l al = ∑
l,f
a
f
l
2
Var(v
f
l )+ ∑
l,f=m,g
a
f
l ag
mCov v
f
l ,vg
m (3.18)
where a
f
l = ∂r
∂v
f
l v=µ
.
3.3 Bootstrapping
The bootstrapping is a special case of Monte Carlo simulations used for a specific purpose of obtaining
an estimate of the sampling distribution by drawing many samples. Generally Monte Carlo (MC) method
is a numerical method for solving mathematical problems using the simulation of random variables. It
is commonly used in cases where analytical evaluation of errors is difficult or impossible. The idea of
the method is to build the model of the measurement and launch it a large number of times, each with
a different random number seed. The width of distribution of the measured values is then taken as the
estimate of the measurement uncertainty.
Some properties of MC:
• MC estimate converges to a true value due to Law of large numbers.
• MC estimate is asymptotically normally distributed due to Central limit theorem.
As we want to know an uncertainties on predicted ratings ˆrij, firstly R is factorized: rij ≈ ˆrij =
∑d
f=1 a0
if b0
j f , where d is a number of latent factors. ˆR is an estimate of the population, so an estimate of
the sampling distribution can be obtained by drawing many samples.
To find an uncertainty on each measurement ˆrij, the following model of the measurement is used:
1. Create new matrix R : rij = ˆrij + εij, where εij is a sample drawn from normal distribution with
the mean at 0 and deviation equal to the normalization factor σ (for how to estimate σ see section
3.1.4).
19
2. Factorize R with d factors: ˆrij = ∑d
f=1 al
if bl
j f .
3. Keep all approximations ˆrij .
After repeating the steps above N times, there will be a sample of N approximations [ˆrij ] for every
element rij (see example of the sample distribution in figure 3.3). If N is high enough, the standard
deviation of the sample [ˆrij ] represents true deviation ∆rij due to the Law of large numbers. In principle,
with Bootstrapping asymmetric uncertainties can be obtained, although here the uncertainty is assumed
to be symmetric.
Figure 3.3: Example of ˆrij distribution from Bootstrapping. ˆrij is a predicted rating and ∆rij is the
standard deviation of the obtained sample [ˆrij ]
20
Chapter 4. Uncertainty analysis using synthetic data
The analytical algorithm described in a section 3.1 and the Bootstrapping method described in 3.3 are
compared in different scenarios: various matrix sizes, latent factors dimensions, noise levels and spar-
sities. Because the analytical algorithm is an approximation based on linearization and Gaussian noise,
error calculations are tested with simulated data.
4.1 Test methodology
Data simulation. First, a factorizable matrix of ratings R0 = A · BT should be initialized, A ∈ Rn×d
and B ∈ Rm×d are matrices of latent factors for users and items respectively. Elements of A and B are
randomly initialized from a uniform distribution. In real cases ratings rij usually have noise on top, so
rij = r0
ij +εij, (4.1)
where r0
ij ∈ R0 and εij is a sample drawn from a normal distribution with the mean at 0 and deviation
equal to chosen σdata. R is randomly divided into known and unknown elements, which are training
Rknown and test Runknown datasets respectively. Taking β% as a sparsity of R, (100 − β)% values of R
are considered to be known. All parameters for initializing an input rating matrix are given in table 4.1.
Other parameters used in tests are given in table 4.2.
Table 4.1: Parameters for initializing input rating matrix
Parameter Description
n Number of users
m Number of items
d Number of latent factors dimensions
ll, lr Uniform distribution limits: a
f
i ,b
f
j ∼ U(ll,lr)
σdata Standard deviation of added Gaussian noise: εij ∼ N(0,σdata)
β Rating matrix sparsity
Table 4.2: Other parameters
Parameter Description
Method Method used for model training
Iterations Number of training iterations
Initialization std.dev. Standard deviation of a zero-mean normal distribution for initializing factors
Simulations Number of simulations in the Bootstrapping
Further below the following input rating matrix parameters are used (table 4.3), unless otherwise stated.
Ratings r0 distributions for various number of latent factors dimensions are shown in figure 4.1.
21
Table 4.3: Parameter values
Parameter Value
n 30
m 40
d 2
ll, lr -1, 1
σdata 0.2
β 50%
Method ALS
Iterations 2000
Initialization std.dev. 0.1
Simulations 50
Figure 4.1: Ratings r0 distribution for various latent factors dimensionality d. With higher d range of
values expands. For d = 2 the values are approximately in the range (−2,2)
Factorization. Rknown is an input matrix for the algorithm. Factorization of Rknown is made using the
same number of factors d as in R initializaion. In the tests below it is assumed that d is known. However,
in reality d should be found. Factorization of R gives predictions ˆrij = ∑d
f=1 ˆa
f
i
ˆb
f
j (figure 4.2).
Since approximations ˆrij try to eliminate the noise N (0,σdata) in rij, the distribution of
rij−ˆrij
σdata
,rij ∈
Rknown should be a standard normal (figure 4.3). It is a check, that optimization procedure found the
minimum: if the standard deviation of the distribution SD
rij−ˆrij
σdata
>> 1, it means the model training
failed and there is no need in finding prediction uncertainties. On the contrary, if the standard deviation
SD
rij−ˆrij
σdata
<< 1, it means the model is overfitted.
Analytical algorithm. If the model trained well, uncertainties are calculated using the analytical algo-
rithm. The pull distribution of normalized residuals is expected to be also a standard normal, in that
approximations ˆrij try to get to r0
ij:
r0
ij − ˆrij
∆rij
,
where r0
ij = ∑d
f=1 a
f
i b
f
j are values in Runknown before adding noise εij in equation (4.1) and ∆rij are the
uncertainties calculated by the analytical algorithm.
22
Figure 4.2: Comparison of unknown ratings and their predictions. The cloud of predictions and ratings
without added noise is narrower, it means ˆr tries to eliminate the noise
Figure 4.3: The pull distribution of normalized
residuals
rij−ˆrij
σdata
, rij ∈ Rknown. The standard devi-
ation is ≈ 1, which means the optimization proce-
dure found the minimum
Figure 4.4: The pull distribution of normalized
residuals
r0
ij−ˆrij
∆rij
, rij ∈ Runknown. The standard de-
viation is slightly higher than 1, which means the
uncertainties ∆rij are underestimated although just
slightly
It is important to note, that in the test the overall normalization factor σ is known, it is a standard
deviation of the normal noise in equation (4.1). If the noise/fluctuations/measurement uncertainties
are underestimated, the standard deviation of normalized residuals will be higher; if overestimated, the
standard deviation of normalized residuals will be smaller than a standard deviation with correct estimate
of the uncertainties. Though the pull distribution of the normalized residuals is normally distributed,
it has standard deviation slightly higher than 1, it means the uncertainties on rij are underestimated
although just by a few percent (figure 4.4).
Boostrapping. If the pull distribution of normalized residuals is a near-standard normal, the uncertain-
ties are estimated by the Bootstrapping method. If the analytical algorithm is correct, a scatter plot of
their results should be straight line ∆rCALC = ∆rMC, where ∆rCALC and ∆rMC are the results of the an-
23
alytical algorithm and the Bootstrapping respectively. A comparison of the uncertainties, computed by
both methods is shown in figure 4.5. For low uncertainties, calculation and bootstrapping agree, which
means that analytical solution is directly proportional to real uncertainties. However, higher uncertain-
ties are underestimated by the calculation. That can be explained by the linear approximation used in the
Maximum likelihood method and equation (3.12).
Figure 4.5: Comparison of calculated ∆rij (rij ∈ Runknown) with the Bootstrapping results. The analytical
algorithm underestimates higher uncertainties
As expected, the uncertainties for unknown ratings are higher than ones for known ratings, because
the model parameters are learned on known ratings (figure 4.6).
Figure 4.6: Histogram of calculated ∆r. with the Bootstrapping results. The uncertainties for unknown
ratings (test set) are higher than for known ratings (training set)
Shown above results are just for one simulated rating matrix and particular chosen parameters. More
tests with various parameters are shown further.
24
4.2 Test of various rating matrices and noise levels
Comparison of the analytical algorithm and the Bootstrapping results for various matrix sizes and noise
levels (figure 4.7) shows, that the analytical algorithm result converges better to the Bootstrapping result
for larger rating matrices. It can be defended by the fact, that for larger matrices with the same sparsity
there are in average more known values for each user and each item.
The analytical algorithm results are more underestimated for higher noise levels, though the algo-
rithm can estimate uncertainties even for a relatively high noise (see ratings distribution for d = 2 in
figure 4.1).
For the same matrix size and sparsity, but with higher number of latent factors dimensions, the
analytical algorithm’s result is getting more unreliable (figure 4.8). With too many factors for highly
sparse rating matrix, the uncertainties are too high, so it is impossible to make reliable predictions.
Figure 4.7: Comparison of the analytical algorithm and the Bootstrapping results for unknown ratings for
various matrix sizes and noise levels σ. From left to right: increasing noise level. From top to bottom:
increasing matrix size. The analytical algorithm’s results are less underestimated for larger matrices
25
Figure 4.8: Comparison of the analytical algorithm and the Bootstrapping results for unknown ratings
for various number of latent factors dimensions d and sparsity β. From left to right: increasing d. From
top to bottom: increasing β. The analytical algorithm’s results are getting less reliable for higher d and
β
4.3 Test of various sparsities
Test if the method works on sparse data is done on the same input matrix. In figures 4.9 and 4.10 a pull
distribution of normalized residuals deviations for various sparsity of rating matrix R are shown for r−ˆr
σdata
and r0−ˆr
∆r respectively. The figures show the result of 20 simulations. From the first figure it may be
concluded that for a very low number of known data points the model may overfit. The second figure
says that for high sparsity of input, the method can not estimate uncertainties for known or unknown
ratings properly. With higher sparsity, prediction uncertainties are increasing (figure 4.11). It is in
line with expectations, because the fewer data are known, the higher are prediction uncertainties. For
highly sparse rating matrices, the uncertainties on rating predictions become larger than the predictions
themselves, which means the predictions are unreliable (figure 4.12).
26
Figure 4.9: Normalized residuals r−ˆr
σdata
deviation on sparsity. For high sparsity the model fails to find the
unknown ratings
Figure 4.10: Normalized residuals r0−ˆr
∆r standard deviation on sparsity. For high sparsity of R the method
can not estimate the uncertainties for known or unknown ratings properly
27
Figure 4.11: Uncertainties distribution on sparsity. With higher sparsity, prediction uncertainties are
increasing. Because the fewer data are known, the higher the prediction uncertainties are. However, for
known ratings the uncertainties increase just slightly
Figure 4.12: Distribution of the relative uncertainties ∆r/|ˆr|. For highly sparse rating matrices, the uncer-
tainties on rating predictions become larger than the predictions themselves, which means the predictions
become unreliable
28
4.4 Estimation of a normalization factor
Parameter values used in the test are listed in table 4.4.
Table 4.4: Parameter values
Parameter Value
n 30
m 40
d 2
ll, lr -1, 1
σdata 0.2
β 0.5
Method ALS
Iterations 1000
Initialization std.dev. 0.1
If the uncertainty σdata is known, the goodness-of-fit given by the equation (3.14) can be tested. To get
an estimate of χ2
Ndof
a Monte Carlo simulation with 100 runs is executed.
Ndata = n·m·(1−β) = 600 known data points.
Nparams = (n+m)·d −d2 = 136 independent parameters (eq. (3.9)).
Ndof = Ndata −Nparams = 466.
Figure 4.13: Histogram of chi-square per degree of freedom. An estimate of chi-square per degree of
freedom is approximately 1
In the test, as expected the average of the distribution is χ2 /Ndof 1 (figure 4.13). If the input noise
σdata is unknown, instead of estimating χ2/Ndof in the MC simulation, the RSSs are calculated for each
run. Then σdata can be estimated by the equation (3.15). Further, an average of 10 estimates are shown.
In the test case there are 136 independent parameters, which means at least 137 of 1200 possible ratings
must be known due to the requirement Ndata > Nparams (3.16), which corresponds to the maximum 88.6%
sparsity of rating matrix (figure 4.14).
29
Figure 4.14: Maximum sparsity of a rating matrix due to the requirement Ndata > Nparams
If the number of degrees of freedom is low (not enough data points), the uncertainty estimate is not
reliable. For high sparsity (low number of degrees of freedom) the normalization factor estimate is much
higher than the input noise in data (figure 4.15). Hence, the uncertainties on rating predictions are high
and predictions are not reliable.
The level of sparsity when ˆσdata starts diverging, depends also on the level of input noise in data. The
higher the input noise and the sparsity are, the higher the divergence of the estimate ˆσdata from actual
σdata is. However, this level differs for different size of a rating matrix. For example, the same test for
a larger rating matrix with the same number of factors (n = 60,m = 80,d = 2) shows that the method
gives reliable noise level estimate even for a highly sparse rating matrix (figure 4.16). Opposite to the
result above, the test on a rating matrix, initialized with higher number of factors (n = 30,m = 40,d = 4),
shows that the model overfits and, hence, the noise level is underestimated (figure 4.17).
Figure 4.15: Normalization factor estimate for various sparsity (n = 30,m = 40,d = 2). For high sparsity
the estimate is much higher than the input noise in data
Before it was assumed that the number of factors in the matrix factorization model dmodel is known
priori and it is the number of factors in initializing the input rating matrix ddata. However, in reality d is
an unknown parameter and, hence, should be found. Test for the wrongly chosen number of factors, when
dmodel = ddata, shows that ˆσdata > σdata for dmodel < ddata. As expected, ˆσdata < σdata for dmodel > ddata
due to overfitting, since the RSS in estimating σdata (see eq. (3.15)) is calculated only on known ratings.
30
Figure 4.16: Normalization factor estimate for various sparsity (n = 60,m = 80,d = 2). The level of
sparsity when the estimate starts diverging is higher for a larger rating matrix
Figure 4.17: Normalization factor estimate for various sparsity (n = 30,m = 40,d = 4). The model
overfits (too many factors for a small matrix) and, hence, the noise level is underestimated
Figure 4.18: Normalization factor estimate for the wrongly chosen number of factors. ˆσdata > σdata for
dmodel < ddata. ˆσdata < σdata for dmodel > ddata
31
4.4.1 Determining d
To determine d cross-validation is used. It is a method for estimating a prediction accuracy of a model
or to perform model selection [25]. The prediction accuracy is estimated for models with various d and
the optimum of d, in which the minimum of prediction accuracy is given, is considered as the model’s
number of factors dmodel. The method holds out part of available data as a validation set and takes the
rest as a training set. The model is fit to the training set, and predictive accuracy is evaluated for the
validation set. The split-train-evaluate cycle is repeated multiple times, and the estimated accuracy is
derived by averaging the rounds.
Depending on how data is split, there are different types of the cross-validation. The most popular one
is k-fold cross-validation, where the dataset is partitioned into mutually exclusive k subsets of approxi-
mately same size and at each round one subset is taken as the validation subset and other k − 1 subsets
are taken as the training subset. However, for highly sparse input matrix the k-fold cross-validation
might fail to give reliable accuracy estimate, since the number of rounds is fixed and equal k. Random
subsampling cross-validation randomly splits the input dataset into training and validation subsets, with
the same proportions of the dataset to include in the subsets at each round. Since a part of the input data
is hold out as the validation set, a sparsity of a rating matrix increases to γ + (1 − γ)β, where γ is the
proportion of the dataset to hold out.
The test shows that with increasing sparsity and increasing input number of factors the method more
often fails to find right ddata, because it means fewer data points and more parameters (figure 4.19).
Figure 4.19: The results of 10 random k-fold cross-validations with k = 5. The middle graphs do not
include results for dmodel = 5 due to the violation of the requirement (3.16). Solid lines show an accuracy
assessed for the validation set, dashed lines - for the training set.
32
4.5 Conclusions
As the result of the tests, the following statements about the analytical algorithm and the Bootstrapping
method may be concluded.
The Bootstrapping method:
• The Bootstrapping method is slow – it needs a lot of computationally intensive runs to give accurate
uncertainties of measurements.
• With enough number of runs, the approximation can be considered as real due to the Bootstrapping
method properties listed in a section 3.3.
The analytical algorithm:
• The analytical method is fast – the only computationally intensive part is calculating the inverse of
precision matrix. There is no need to run matrix factorization multiple times as in the Bootstrap-
ping method.
• In the analytical method the overall normalization factor σdata in equations (3.5, 3.6, 3.7) is re-
quired. One run of the model training gives a good estimate of the normalization factor, if the
model training finds a global minimum.
• The analytical algorithm’s results are less underestimated for larger matrices.
• The analytical algorithm’s results are getting less reliable for lower Ndof and higher input noise.
• For highly sparse rating matrices, the uncertainties on rating predictions become larger than the
predictions themselves, which means the predictions are unreliable.
The analysis of the goodness-of-fit shows that the method is implemented without programming
errors and stable even for high sparsity. Considering that in real recommendation problems the ratings
matrix is large and sparse (for instance, well known MovieLens 1M dataset’s sparsity is ≈ 95%) and
that the analytical algorithm works on simulated large sparse matrices, the algorithm may be useful for
estimating noise levels in real data.
33
Chapter 5. Calculations on real data
5.1 Financial market data
An input dataset is the transactions of emerging market bonds bonds request for quotes (RFQs). A bond
is a debt investment in which an investor loans money to an entity (usually corporate or governmental)
which borrows the funds for a set period of time at a variable or fixed interest rate. Bonds are issued by
companies, municipalities and states to raise money and finance a variety of projects. Specification of
the dataset used in the recommender system is provided in table 5.1.
Table 5.1: Input data columns
Column Description
Trade date Date of sending the request for quote
Customer The company or government who bought or sold the bond
Issuer Bond issuer
Bond The name of bond
Buy/Sell Type of trading (buy or sell)
Price The market price at trade date of a tradeable bond, repre-
sented by a percentage of the bond’s par value
Volume Number of bonds traded in this transaction
Time to maturity The date on which the bond will mature, reach par
Other parameters of bonds are not considered. For the sake of privacy protection, the data is anonymized.
Additionally for convenience all customers, bonds and issuers are assigned individual numerical identi-
fiers - natural numbers in ascending order. An overview of the data is given in table 5.2.
Table 5.2: Input data values
Column Values
Trade date from 29/04/2015 to 29/05/2016
Customer 1400 unique customers
Issuer 131 unique issuers
Bond 414 unique bonds
Buy/Sell 13880 buy RFQs, 13458 sell RFQs
Price [51.75 %, 172.5 %]
Volume [103,4.85·107]
Time to maturity From 5 days to 30 years
The data contains 27338 transactions over one year starting 29/04/2015 (figure 5.1). The blank period
around the trade day 250 without any trade corresponds to holidays – twelve days from the December
23th till the January 3th inclusive.
34
Figure 5.1: Transactions per day
There are 1400 customers. The most trading customers, both by total traded volume and number of
trades, are the five global inter-dealer brokers. The normalized histogram of a number of transactions
per week of these companies shows that they started trading less this year, in comparison with last year
(figure 5.2). Customers, trading a lot, buy and sell bonds almost equally. While for less trading cus-
tomers, the correlation between numbers of buy and sell transactions is less (figure 5.3). 358 customers
only buys, 283 customers only sells and others buy and sell.
Figure 5.2: Distribution of transactions per week. The five most trading customers are inter-dealer
brokers. They are trading less this year, in comparison with last year
There are 131 issuers with total 414 bonds in the dataset. About half of the issuers have one bond and
some have more than 20 different bonds (figure 5.4). Not all issuers’ bonds were traded in all quarters:
some issuers’ bonds were not traded in 2016, some issuers’ bonds do not have transactions history in
2015 (figure 5.5). Time to maturity of traded bonds range from five days to 30 years (figure 5.6). Bonds
with time to maturity within 10 years are most popular. The histograms of prices and volumes of trades
are presented in figure 5.7.
35
Figure 5.3: Comparison of customers’ buy and sell transactions. Customers, trading a lot, almost equally
buy and sell. While for less trading customers, the correlation between numbers of buy and sell transac-
tions is less
Figure 5.4: The cumulative histogram of issuers’ bonds. About half of issuers have one bond and some
have more than 20 different bonds
Figure 5.5: Time when issuers’ bonds were traded. Not all issuers’ bonds were traded in all quarters
36
Figure 5.6: Time to maturity of transactions. Bonds with time to maturity within 10 years are most
popular
Figure 5.7: Price and volume of trades
5.2 Rating matrix
In matrix factorization, an input data should be represented as a matrix of ratings (rui) ∈ Rn×m, where rui
is the rating given by customer u to item i. Items may be bonds or issuers. To have a rating matrix dense
enough for matrix factorization, the temporal aggregation is applied with following periods: one month,
one quarter (three months) and one year (table 5.3). If bonds are taken as items, the rating matrix is too
sparse. For this reason, bonds issuers are taken as items.
Table 5.3: Size and density of rating matrix depending on period
Period (T) Time Customers Bonds Density Issuers Density
1 month February 2016 425 278 1.16% 81 3.22%
1 quarter October-December 2015 827 312 1.65% 88 4.05%
1 year May-2015-April 2016 1400 414 1.98% 131 3.82%
The rating matrix for a period of one year, with issuers ordered by a number of customers traded their
bonds, is shown in figure 5.8. Figures 5.9 and 5.10 show how many customers traded how many unique
bonds from how many issuers within the entire period. 445 customers traded only one bond each, 957
37
customers traded equal to or less than 5 bonds each. 60 bonds were traded by only one customer, 108
bonds were traded by equal to or less than 5 customers.
Figure 5.8: Rating matrix for a period of one year. The issuers are ordered ascendingly by a number of
customers traded their bonds, then customers are sorted by a number of issuers
Figure 5.9: The cumulative histogram of unique
bonds customers traded
Figure 5.10: The cumulative histogram of unique
customers per bond
Since the rating matrix is highly sparse, densifying the matrix may increase accuracy of predictions. The
straight-forward approach of densifying a rating matrix is to remove unpopular users and/or items, i.e.
with a low number of known ratings. To make a rating matrix denser, an algorithm, which ensures that
all users and items have at least l ∈ N number of entries in a rating matrix, is applied (algorithm 1). On
each round the algorithm removes users and items from a rating matrix which have a number of entries
in the rating matrix less than l.
As the result, with increasing l, a number of customers and items in the rating matrix decrease drastically.
The rating matrix becomes smaller and denser (tables 5.4, 5.5 and 5.6).
Table 5.4: One month period rating matrix parameters for various entries limit l
l Customers Issuers Elements Density, %
1 425 81 1108 3.22
2 198 71 873 6.21
3 118 57 693 10.30
4 71 47 526 15.76
5 45 35 380 24.13
6 21 21 197 44.67
38
Algorithm 1 Removing users and items without enough known ratings
Data: rating matrix (rui) ∈ Rn×m {rui - rating of a user u to an item i},
minimum number of entries l ∈ N.
Result: reduced rating matrix.
1: repeat
2: U ←− set of users in R
3: I ←− set of items in R
4: for all user u ∈ U do
5: if number of known ratings R[u,:] < l then
6: R ←− remove corresponding row u in R
7: end if
8: end for
9: STOP = true
10: for all item i ∈ I do
11: if number of known ratings R[:,i] < l then
12: R ←− remove corresponding column i in R
13: STOP = false
14: end if
15: end for
16: until STOP
Table 5.5: One quarter period rating matrix parameters for various entries limit l
l Customers Issuers Elements Density, %
1 827 88 2951 4.05
2 449 76 2565 7.52
3 321 70 2299 10.23
4 230 65 2012 13.46
5 165 64 1748 16.55
6 125 59 1526 20.69
Table 5.6: One year period rating matrix parameters for various entries limit l
l Customers Issuers Elements Density, %
1 1400 131 7014 3.82
2 910 95 6492 7.51
3 677 82 6007 10.82
4 523 80 5539 13.24
5 436 80 5191 14.88
6 362 75 4799 17.68
5.3 Implicit feedback
As the data do not represent explicit “preferences” of customers towards bonds, “preferences” should be
get from implicit feedback. Two ways of getting “preferences” from implicit feedback are applied.
Preference to buy or sell. An item may be both sold and bought by a user. The total amount of user’s
trades represents whether the user “likes” the item or not. “Preference” is chosen as the following: if a
39
user buys an item more than sells or never sells, it is assumed that the user “likes” the item. And it is
the other way around for “disliking” the item. The following aggregation procedure is applied. First, the
amount of a trade is calculated as a multiplication of volume and price (in percentage) of the trade with
a sign depending on the trade’s type:
at =
Volume·Price if “buy
−Volume·Price if “sell
Then for each pair of user u and item i the total traded amount is calculated within chosen period:
rui = ∑t∈(u,i) at.
pui =
1 if rui > 0
−1 if rui < 0
Cumulative amount of buy and sells of top 6 most traded pairs of customer and issuer is shown in figure
5.11. In come cases, there is a big difference between amount of buys and amount of sells, which may
suggest a strong preference in a particular type of trading (buy or sell). Yet some customers tend to buy
and sell the same issuer’s bonds equally (green and dark blue lines).
Figure 5.11: Cumulative amount of buy and sells of top 5 most traded pairs of customer and issuer
Preference to trade an issuer’s bonds. One item may be traded by the user multiple times and this
frequency of transactions, with no regard to type of transactions, may be used as an indication of “pref-
erence”. More frequent means stronger indication that the user likes the item. The “preference” may be
calculated as:
pui = log(1+nui),
where nui is the number of transactions of item i by user u within a period. The logarithm is taken, since
some user-item pairs have large number of transactions (see figures 5.12 and 5.13).
40
Figure 5.12: Number of transactions per each user-
item pair within all dates
Figure 5.13: The binary logarithm of a number of
transactions per each user-item pair within all dates
5.4 Factorization
A main problem in matrix factorization is to choose the number of factors, d, for a given dataset. To
determine d, k-folds cross-validation technique, described in section 4.4.1, is used. For each matrix the
number of factors d should be chosen small enough to not violate the requirement (3.16), that the number
of independent parameters should not exceed the number of known data points (section 3.1.2). As cross-
validation hold out a part of a dataset as a validation subset S = ST ∪SV , the sparsity of a training subset
ST is higher than the rating matrix sparsity β: βT = β + (1 − β)/k. Therefore, the maximum possible
d may be less. As one month and one quarter rating matrices are small and sparse, 20-folds cross-
validation is used for them. The optimum of d, in which the minimum of prediction accuracy is given,
is considered as the model’s number of factors.
The optimisation method is chosen MCMC, as it performs better than other methods (appendix A).
Parameters for performing cross-validation are listed in table 5.7.
Table 5.7: Cross-validation check parameters
Parameter Value
d {1···6}
Method MCMC
Iterations 1000
Initialization std.dev. 0.1
Number of folds, k 10, 20
5.4.1 Preference to buy or sell
Prediction of preference to buy or sell is the binary classification task. The prediction accuracy of the
binary classification is the proportion of correctly classified values:
ACC =
∑True buy+∑True sell
∑Total population
The classifier gives as predictions the probability estimates of the positive (buy) class. Normally the
threshold is 0.5. The results of performing 10-folds cross-validations to validate classification with the
threshold equal 0.5 are shown in figure 5.14. Limiting a number of entries applying the algorithm 1
on the matrices, does not have any effect on the accuracy. Even though the accuracy on the one month
41
rating matrix for some cases is slightly higher than on other matrices, the accuracy estimate variance is
high. The best accuracy with low variance is get on the one year period matrix and is equal to 57%.
Sensitivity (true positive rate) and specificity (true negative rate) values suggest, that by changing the
classification probability threshold, the accuracy may be improved. However, the distribution of the
two classes predictions highly overlaps (figure 5.15). A receiver operating characteristic (ROC), which
illustrates the performance of a binary classifier as its classification probability threshold is varied, shows
that adjusting the threshold does not give improvements in the accuracy. The mean area under curve
(AUC) for the validation sets is 0.59. It may be concluded that predicting buy and sell preferences of
users towards items using the matrix factorization model, with the assumption that the amount of user’s
trades represents user’s preference to buy or sell, is not reliable.
Figure 5.14: Accuracy score, sensitivity and specificity for matrices with ratings as preference to buy
or sell. Matrices’ parameters are listed in tables 5.4, 5.5 and 5.6. Some of the results for low limit of
entries l do not include results for high d due to the violation of the requirement (3.16). Solid lines show
an accuracy assessed for validation sets, dotted lines - for training sets. Accuracies are shown with a
standard deviation interval. Higher accuracy is better
42
Figure 5.15: The distribution of the buy
and sell classes predictions in valida-
tion sets, created from 10-folds cross-
validation on T = 1 year, l = 1 matrix and
d = 1 model. For other settings results are
more or less the same
Figure 5.16: ROC of validation sets, created from 10-folds
cross-validation on T = 1 year, l = 1 matrix and d = 1
model. For other settings results are more or less the same
5.4.2 Preference to trade
Predictions of preference to trade is the regression task. To compare errors on the datasets and the
models, the prediction accuracy is chosen as normalized MSE (NMSE) – MSE divided by a mean
square of targets.
NMSES =
∑rij∈S rij − ˆrij
2
S 2
2
, (5.1)
where S is a set of ratings.
For the not “cleaned” one year period matrix (l = 1) and 10-cross validation, the maximum dmax = 4.
The result of cross-validation gives an optimal dopt = 4 (figure 5.17).
Figure 5.17: NMSE for the one year period rating matrix without limit of entries (T=1 year, l = 1).
Dashed lines show one standard-deviation interval
The results of performing 10 and 20 folds cross-validations on “cleaned” matrices, which parameters
are listed in tables 5.4, 5.5 and 5.6, are shown in figure 5.18. Limiting a number of entries applying the
algorithm 1 on the matrices, have different effects. In the one month period rating matrix, it decreases the
accuracy. In the one quarter and the one year periods matrices, it does not have any considerable effect
43
on the accuracy. The benefit is that it makes possible to use higher number of latent factors dimension
d. The lowest error is achieved when T = 1 month, l = 1, d = 1. For the one quarter period matrix, an
optimal dopt = 3. For the one year period matrix, dopt = 6. Unfortunately, the comparison of ratings and
their predictions in validation sets shows that the predictions are quite off from real values. The results of
20-folds cross-validation of the model with d = 1 on the one month period rating matrix without limiting
entries, 20-folds cross-validation of the model with d = 3 on the one quarter period matrix and 10-folds
cross-validation of the model with d = 6 on the one year period matrix with limiting entries l = 3 are
shown in figure 5.19. It may be concluded that it is impossible to make reliable predictions in the one
month period matrix.
To get an accuracy of individual predictions, uncertainties are calculated by the analytical algorithm
(section 3.1). The algorithm requires point estimates of parameters. However, the optimization method
MCMC samples parameters from the posterior distribution and gives final predictions as a mean of all
predictions made during learning process. Therefore, chosen models are optimized again by SGD with
learning rate η = 0.02 and regularization λ = 0.2. To get a test set, cross-validation is used. The pull
distribution of normalized residuals is expected to be a standard normal:
rij−ˆrij
∆rij
, where rij are values
in validation sets Runknown and ∆rij are the uncertainties calculated by the analytical algorithm. Values
when rij = 1 and ˆrij = 1 are dropped off from the distribution, since the minimum rating in the train set
is ≥ 1 and libFM has predictions’ borders:
min
rij∈Rtrain
(rij) ≤ ˆr ≤ max
rij∈Rtrain
(rij).
It means ˆrij = 1 does not show any preference and should be ignored. The pull distributions of normal-
ized residuals show, that calculated errors are mostly correct. However, the distributions have a high
peak around zero, which means there are many overestimated errors. The distributions are skewed right,
because there are more underestimated rating predictions, than overestimated (figure 5.20). The uncer-
tainties on rating predictions are mostly lower than the predictions for T = 1 quarter. For T = 1 year,
there is a bigger portion of predictions with uncertainties higher than the predictions (figure 5.21).
5.5 Conclusions
Two ways of getting preferences from the implicit feedback are applied: preference to buy or sell and
preference to trade an issuer’s bonds, leading to the classification and the regression tasks respectively.
Temporal aggregation of the data is applied with periods of one month, one quarter and one year. Den-
sifying a rating matrix removing items and users with a low number of entries in the rating matrix is
performed. Uncertainties are calculated by the analytical algorithm for the regression task. To validate
models, k-folds cross-validation technique is used.
Predicting buy and sell preferences of users towards items using the matrix factorization model, with
the assumption that the amount of user’s trades represents user’s preference to buy or sell, seems to be
unreliable.
Densifying a rating matrix make possible using higher number of factors. However, it does not affect
the predictions accuracy in a positive way, yet it removes predictions with high uncertainties.
Temporal aggregation of the data with a period of one month results in a small sparse rating matrix,
on which the factorization does not perform well. Results for the one quarter and one year periods are
better. Taking into consideration calculated by the analytical algorithm uncertainties on the individual
predictions possibly improves the recommendations. Even though there are many overestimated errors,
the uncertainties on rating predictions are mostly lower than the predictions themselves. For one year
period matrix, there is a bigger portion of predictions with uncertainties higher than the predictions.
44
Figure 5.18: NMSE for all “cleaned” matrices with ratings as “preference to trade a bond”, which
parameters are listed in tables 5.4, 5.5 and 5.6. Some of the results for low limit of entries l do not
include results for high d due to the violation of the requirement (3.16). Solid lines show an accuracy
assessed for validation sets, dotted lines - for training sets
Figure 5.19: The comparison of target ratings and their predictions in the validation sets of cross-
validation of the optimal models. Here, ratings are binary log of a number of transactions
45
Figure 5.20: Pull distribution of normalized residuals
rij−ˆrij
∆rij
,rij ∈ Runknown. The distributions are cropped
in the limit {-5;5}. The normalized residuals standard deviation for T = 1 quarter, l = 1 matrix and d = 3
model (top left) is much higher than 1, because there are some highly underestimated uncertainties. The
distributions have peak around zero, which means there are many overestimated errors. The distributions
are skewed right, because the rating predictions are more underestimated, than overestimated
Figure 5.21: Distribution of the relative uncertainties ∆rij/ˆrij,rij ∈ Runknown. For T = 1 quarter, the
uncertainties on rating predictions are mostly lower than the predictions. For T = 1 year, there is a larger
portion of predictions with uncertainties higher than the predictions
46
Chapter 6. Conclusions and future work
While recommender systems mostly give only overall accuracy using a test dataset, this approach is to
give accuracies on all of the individual predicted preferences.
The algorithm for estimating uncertainty on individual predictions for the basic matrix factorization
model and the more general factorization machine is derived. On synthetic data the algorithm gives
reliable estimations of uncertainties, showing that the algorithm is well understood. It was shown that
for sparse input data, uncertainties are high. However, they may be estimated for highly sparse, large
matrices.
The algorithm is applied on a real data - the transactions of emerging market bonds requests for
quotes (RFQs). A rating matrix for the matrix factorization model is built with various settings. Best
results are obtained for ratings as preference to trade an issuer’s bonds and temporal aggregation with
periods of one quarter and one year. For these matrices uncertainties are calculated by the analytical
algorithm. Taking into consideration calculated by the analytical algorithm uncertainties on individual
predictions possibly improves recommendations.
It is shown that the accuracy of SGD and ALS methods highly depends on hyperparameters. In
contradiction, MCMC automatically determines hyperparameters and performs significantly better than
other examined methods both for dense and sparse matrices. However, MCMC should not be used in a
couple with our algorithm, because the algorithm requires point estimates of parameters to calculate the
uncertainties.
Due to the time constraints, other ways of building a rating matrix from the dataset haven’t checked.
Also, the power of the factorization machine to incorporate features is not exploited. For example,
bond’s price and volume might be included into feature matrix. In the future other models, possibly
more suitable for financial markets data, might be found and the algorithm for estimating uncertainty on
individual predictions might be derived for them.
47
Bibliography
[1] S. Rendle, “Factorization machines with libfm,” ACM Trans. Intell. Syst. Technol., vol. 3, pp. 57:1–
57:22, May 2012.
[2] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,”
Computer, vol. 42, no. 8, pp. 30–37, 2009.
[3] G. Adomavicius and A. Tuzhilin, “Toward the next generation of recommender systems: A sur-
vey of the state-of-the-art and possible extensions,” IEEE Transactions on Knowledge and Data
Engineering, vol. 17, no. 6, pp. 734–749, 2005.
[4] P. Lops, M. de Gemmis, and G. Semeraro, Recommender Systems Handbook, ch. Content-based
Recommender Systems: State of the Art and Trends, pp. 73–105. Boston, MA: Springer US, 2011.
[5] J. B. Schafer, D. Frankowski, J. Herlocker, and S. Sen, The Adaptive Web: Methods and Strategies
of Web Personalization, ch. Collaborative Filtering Recommender Systems, pp. 291–324. Berlin,
Heidelberg: Springer Berlin Heidelberg, 2007.
[6] J. Bennett, C. Elkan, B. Liu, P. Smyth, and D. Tikk, “Kdd cup and workshop 2007,” SIGKDD
Explor. Newsl., vol. 9, pp. 51–52, Dec. 2007.
[7] G. Linden, B. Smith, and J. York, “Amazon.com recommendations: Item-to-item collaborative
filtering,” IEEE Internet Computing, vol. 7, no. 1, pp. 76–80, 2003.
[8] Y. Hu, Y. Koren, and C. Volinsky, “Collaborative filtering for implicit feedback datasets,” in 2008
Eighth IEEE International Conference on Data Mining, pp. 263–272, Dec 2008.
[9] J. L. Herlocker, J. A. Konstan, and J. Riedl, “Explaining collaborative filtering recommendations,”
in Proceedings of the 2000 ACM Conference on Computer Supported Cooperative Work, CSCW
’00, (New York, NY, USA), pp. 241–250, ACM, 2000.
[10] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl, “Evaluating collaborative filtering
recommender systems,” ACM Trans. Inf. Syst., vol. 22, pp. 5–53, Jan. 2004.
[11] G. Shani, L. Rokach, B. Shapira, S. Hadash, and M. Tangi, “Investigating confidence displays for
top-n recommendations,” Journal of the American Society for Information Science and Technology,
vol. 64, no. 12, pp. 2548–2563, 2013.
[12] N. Rubens, M. Elahi, M. Sugiyama, and D. Kaplan, Active Learning in Recommender Systems,
pp. 809–846. Boston, MA: Springer US, 2015.
[13] A. Karatzoglou and M. Weimer, Quantile Matrix Factorization for Collaborative Filtering,
pp. 253–264. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010.
[14] A. Gunawardana and G. Shani, Evaluating Recommender Systems, pp. 265–308. Boston, MA:
Springer US, 2015.
48
[15] G. Tak´acs, I. Pil´aszy, B. N´emeth, and D. Tikk, “Investigation of various matrix factorization meth-
ods for large recommender systems,” in 2008 IEEE International Conference on Data Mining
Workshops, pp. 553–562, Dec 2008.
[16] D. Agarwal and B.-C. Chen, “Regression-based latent factor models,” in Proceedings of the 15th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09,
(New York, NY, USA), pp. 19–28, ACM, 2009. 618092.
[17] S. Rendle, “Learning recommender systems with adaptive regularization,” in Proceedings of the
Fifth ACM International Conference on Web Search and Data Mining, WSDM ’12, (New York,
NY, USA), pp. 133–142, ACM, 2012.
[18] D. Zachariah, M. Sundin, M. Jansson, and S. Chatterjee, “Alternating least-squares for low-rank
matrix reconstruction,” IEEE Signal Processing Letters, vol. 19, pp. 231–234, April 2012.
[19] S. Rendle, Z. Gantner, C. Freudenthaler, and L. Schmidt-Thieme, “Fast context-aware recommen-
dations with factorization machines,” in Proceedings of the 34th International ACM SIGIR Confer-
ence on Research and Development in Information Retrieval, SIGIR ’11, (New York, NY, USA),
pp. 635–644, ACM, 2011.
[20] T. Silbermann, I. Bayer, and S. Rendle, “Sample selection for mcmc-based recommender systems,”
in Proceedings of the 7th ACM Conference on Recommender Systems, RecSys ’13, (New York, NY,
USA), pp. 403–406, ACM, 2013.
[21] M. Li, Z. Liu, A. J. Smola, and Y.-X. Wang, “Difacto: Distributed factorization machines,” in
Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, WSDM
’16, (New York, NY, USA), pp. 377–386, ACM, 2016.
[22] Y. Dodge, The Concise Encyclopedia of Statistics, ch. Maximum Likelihood, pp. 334–336. New
York, NY: Springer New York, 2008.
[23] F. James, “Interpretation of the shape of the likelihood function around its minimum,” Computer
Physics Communications, vol. 20, no. 1, pp. 29 – 35, 1980.
[24] Hayden and D. R. Twede, “Observations on the relationship between eigenvalues, instrument noise,
and detection performance,” in Proceeding of SPIE 4816, Imaging Spectrometry VIII (S. S. Shen,
ed.), p. 355, SPIE, Nov. 2002.
[25] S. Arlot and A. Celisse, “A survey of cross-validation procedures for model selection,” Statist.
Surv., vol. 4, pp. 40–79, 2010.
49
Appendix A. Minimization methods test
Methods’ hyperparameters are listed in table A.1.
Table A.1: Hyperparameters
Parameter Description SGD SGDA ALS MCMC
η Learning rate + + - -
λ Regularization values + - + -
Initialization std.dev. The standard deviation for initialization + + + +
Input rating matrices for tests are simulated as described in section 4.1 with parameter values listed in
table A.2.
Table A.2: Parameter values
Parameter Value
n 30
m 40
d 2
ll, lr -1, 1
σdata 0.2
β 0.7
βv 0.1
Iterations 1000
Initialization std.dev. 0.1
βv is the proportion of a training dataset to holdout as a validation subset in the SGDA method.
Learning rate. SGD and SGDA are highly dependent of a learning rate value. Since SGDA may find
regularization values automatically, the effect of the learning rate on optimization convergence is tested
for SGDA method. The test shows that if the learning rate η is too small (η = 0.0001,0.001), the
algorithm converges slowly; if η is high (η = 0.2,0.5), it may not find any minimum (figures A.1, A.2).
For further tests the learning rate is chosen η = 0.01.
50
Figure A.1: Convergence of SGDA with various learning rates. Results of model training on 10 simu-
lated rating matrices of sparsity β = 0.7 (RMSE is calculated for a train dataset). SGDA highly depends
on a learning step η: if η is too small (η = 0.0001,0.001), the algorithm converges slowly; if η is high
(η = 0.2,0.5), it may not find any minimum
51
Figure A.2: Convergence of SGDA with various learning rates. Results of model training on 10 simu-
lated rating matrices of sparsity β = 0.7 (RMSE is calculated for a test dataset). SGDA highly depends
on a learning step η: if η is too small (η = 0.0001,0.001), the algorithm converges slowly; if η is high
(η = 0.2,0.5), it may not find any minimum
52
Regularization values. The goal of using regularization is to generalize a model for enabling it to not
just model known ratings but to predict unknown ratings as well. The regularization values are typically
searched using grid search. As every parameter may have own regularization value, the grid search has
exponential complexity. In the test, to reduce complexity the number of regularization parameters is
reduced to one, i.e. all parameters have one regularization value. A measure of accuracy is taken as the
root-mean-square error between predicted ratings ˆr and actual ratings without noise on top r0 in a test
dataset RMSER0
test
:
RMSES =
∑rij∈S rij − ˆrij
2
|S|
(A.1)
There is no significant difference in accuracy of the methods on a dense matrix (β = 0.5), yet the high
value of regularization prevents SGD from finding optimum and makes ALS more stable. On a rating
matrix of high sparsity β = 0.8 MCMC performs significantly better than other methods; ALS and
SGD without regularization give the worst accuracy, yet using regularization increases an accuracy.
SGDA (SGD with adaptive regularization) performs as good as SGD with a proper chosen regularization
parameter value. It makes SGDA more preferable than SGD, because it does not require grid search.
However SGDA holds out part of a train dataset as a validation subset, which may make using SGDA
impossible for a highly sparse input. Accuracies on a train and test datasets for a sparse matrix (β = 0.8)
are presented in table A.3. The accuracy RMSER0
test
is shown in figure A.3. See examples of methods
result in figures A.4, A.5.
MCMC performs significantly better than other examined methods both for dense and sparse matri-
ces.
53
Thesis_Nazarova_Final(1)
Thesis_Nazarova_Final(1)
Thesis_Nazarova_Final(1)
Thesis_Nazarova_Final(1)
Thesis_Nazarova_Final(1)

More Related Content

What's hot

Microsoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data ScienceMicrosoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data ScienceMashfiq Shahriar
 
Automatic detection of click fraud in online advertisements
Automatic detection of click fraud in online advertisementsAutomatic detection of click fraud in online advertisements
Automatic detection of click fraud in online advertisementsTrieu Nguyen
 
A Bilevel Optimization Approach to Machine Learning
A Bilevel Optimization Approach to Machine LearningA Bilevel Optimization Approach to Machine Learning
A Bilevel Optimization Approach to Machine Learningbutest
 
dissertation_ulrich_staudinger_commenting_enabled
dissertation_ulrich_staudinger_commenting_enableddissertation_ulrich_staudinger_commenting_enabled
dissertation_ulrich_staudinger_commenting_enabledUlrich Staudinger
 
Active Learning Literature Survey
Active Learning Literature SurveyActive Learning Literature Survey
Active Learning Literature Surveybutest
 
Mth201 COMPLETE BOOK
Mth201 COMPLETE BOOKMth201 COMPLETE BOOK
Mth201 COMPLETE BOOKmusadoto
 
Incorporating Learning Strategies in Training of Deep Neural Networks for Au...
Incorporating Learning Strategies in Training of Deep Neural  Networks for Au...Incorporating Learning Strategies in Training of Deep Neural  Networks for Au...
Incorporating Learning Strategies in Training of Deep Neural Networks for Au...Artur Filipowicz
 
Machine learning solutions for transportation networks
Machine learning solutions for transportation networksMachine learning solutions for transportation networks
Machine learning solutions for transportation networksbutest
 
ExamsGamesAndKnapsacks_RobMooreOxfordThesis
ExamsGamesAndKnapsacks_RobMooreOxfordThesisExamsGamesAndKnapsacks_RobMooreOxfordThesis
ExamsGamesAndKnapsacks_RobMooreOxfordThesisRob Moore
 
biometry MTH 201
biometry MTH 201 biometry MTH 201
biometry MTH 201 musadoto
 
Thesis- Multibody Dynamics
Thesis- Multibody DynamicsThesis- Multibody Dynamics
Thesis- Multibody DynamicsGuga Gugaratshan
 

What's hot (17)

DCFriskpaper280215
DCFriskpaper280215DCFriskpaper280215
DCFriskpaper280215
 
Microsoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data ScienceMicrosoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data Science
 
Automatic detection of click fraud in online advertisements
Automatic detection of click fraud in online advertisementsAutomatic detection of click fraud in online advertisements
Automatic detection of click fraud in online advertisements
 
A Bilevel Optimization Approach to Machine Learning
A Bilevel Optimization Approach to Machine LearningA Bilevel Optimization Approach to Machine Learning
A Bilevel Optimization Approach to Machine Learning
 
Knustthesis
KnustthesisKnustthesis
Knustthesis
 
dissertation_ulrich_staudinger_commenting_enabled
dissertation_ulrich_staudinger_commenting_enableddissertation_ulrich_staudinger_commenting_enabled
dissertation_ulrich_staudinger_commenting_enabled
 
Active Learning Literature Survey
Active Learning Literature SurveyActive Learning Literature Survey
Active Learning Literature Survey
 
Thesis
ThesisThesis
Thesis
 
main
mainmain
main
 
Mth201 COMPLETE BOOK
Mth201 COMPLETE BOOKMth201 COMPLETE BOOK
Mth201 COMPLETE BOOK
 
Non omniscience
Non omniscienceNon omniscience
Non omniscience
 
Incorporating Learning Strategies in Training of Deep Neural Networks for Au...
Incorporating Learning Strategies in Training of Deep Neural  Networks for Au...Incorporating Learning Strategies in Training of Deep Neural  Networks for Au...
Incorporating Learning Strategies in Training of Deep Neural Networks for Au...
 
Thesis
ThesisThesis
Thesis
 
Machine learning solutions for transportation networks
Machine learning solutions for transportation networksMachine learning solutions for transportation networks
Machine learning solutions for transportation networks
 
ExamsGamesAndKnapsacks_RobMooreOxfordThesis
ExamsGamesAndKnapsacks_RobMooreOxfordThesisExamsGamesAndKnapsacks_RobMooreOxfordThesis
ExamsGamesAndKnapsacks_RobMooreOxfordThesis
 
biometry MTH 201
biometry MTH 201 biometry MTH 201
biometry MTH 201
 
Thesis- Multibody Dynamics
Thesis- Multibody DynamicsThesis- Multibody Dynamics
Thesis- Multibody Dynamics
 

Similar to Thesis_Nazarova_Final(1)

Interactive Filtering Algorithm - George Jenkins 2014
Interactive Filtering Algorithm - George Jenkins 2014Interactive Filtering Algorithm - George Jenkins 2014
Interactive Filtering Algorithm - George Jenkins 2014George Jenkins
 
Aspect_Category_Detection_Using_SVM
Aspect_Category_Detection_Using_SVMAspect_Category_Detection_Using_SVM
Aspect_Category_Detection_Using_SVMAndrew Hagens
 
Valkhof, Aart 0182737 MSc ACT
Valkhof, Aart 0182737 MSc ACTValkhof, Aart 0182737 MSc ACT
Valkhof, Aart 0182737 MSc ACTAart Valkhof
 
Masters Thesis - Ankit_Kukreja
Masters Thesis - Ankit_KukrejaMasters Thesis - Ankit_Kukreja
Masters Thesis - Ankit_KukrejaANKIT KUKREJA
 
Neural Networks on Steroids
Neural Networks on SteroidsNeural Networks on Steroids
Neural Networks on SteroidsAdam Blevins
 
Face recognition vendor test 2002 supplemental report
Face recognition vendor test 2002   supplemental reportFace recognition vendor test 2002   supplemental report
Face recognition vendor test 2002 supplemental reportSungkwan Park
 
Integrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
Integrating IoT Sensory Inputs For Cloud Manufacturing Based ParadigmIntegrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
Integrating IoT Sensory Inputs For Cloud Manufacturing Based ParadigmKavita Pillai
 
2012-02-17_Vojtech-Seman_Rigorous_Thesis
2012-02-17_Vojtech-Seman_Rigorous_Thesis2012-02-17_Vojtech-Seman_Rigorous_Thesis
2012-02-17_Vojtech-Seman_Rigorous_ThesisVojtech Seman
 
High Performance Traffic Sign Detection
High Performance Traffic Sign DetectionHigh Performance Traffic Sign Detection
High Performance Traffic Sign DetectionCraig Ferguson
 
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingBig Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingGabriela Agustini
 
A Seminar Report On NEURAL NETWORK
A Seminar Report On NEURAL NETWORKA Seminar Report On NEURAL NETWORK
A Seminar Report On NEURAL NETWORKSara Parker
 
LATENT FINGERPRINT MATCHING USING AUTOMATED FINGERPRINT IDENTIFICATION SYSTEM
LATENT FINGERPRINT MATCHING USING AUTOMATED FINGERPRINT IDENTIFICATION SYSTEMLATENT FINGERPRINT MATCHING USING AUTOMATED FINGERPRINT IDENTIFICATION SYSTEM
LATENT FINGERPRINT MATCHING USING AUTOMATED FINGERPRINT IDENTIFICATION SYSTEMManish Negi
 
Automatic Detection of Performance Design and Deployment Antipatterns in Comp...
Automatic Detection of Performance Design and Deployment Antipatterns in Comp...Automatic Detection of Performance Design and Deployment Antipatterns in Comp...
Automatic Detection of Performance Design and Deployment Antipatterns in Comp...Trevor Parsons
 

Similar to Thesis_Nazarova_Final(1) (20)

Interactive Filtering Algorithm - George Jenkins 2014
Interactive Filtering Algorithm - George Jenkins 2014Interactive Filtering Algorithm - George Jenkins 2014
Interactive Filtering Algorithm - George Jenkins 2014
 
Aspect_Category_Detection_Using_SVM
Aspect_Category_Detection_Using_SVMAspect_Category_Detection_Using_SVM
Aspect_Category_Detection_Using_SVM
 
Valkhof, Aart 0182737 MSc ACT
Valkhof, Aart 0182737 MSc ACTValkhof, Aart 0182737 MSc ACT
Valkhof, Aart 0182737 MSc ACT
 
Tutorial
TutorialTutorial
Tutorial
 
Masters Thesis - Ankit_Kukreja
Masters Thesis - Ankit_KukrejaMasters Thesis - Ankit_Kukreja
Masters Thesis - Ankit_Kukreja
 
Master_Thesis
Master_ThesisMaster_Thesis
Master_Thesis
 
Neural Networks on Steroids
Neural Networks on SteroidsNeural Networks on Steroids
Neural Networks on Steroids
 
Dm
DmDm
Dm
 
Face recognition vendor test 2002 supplemental report
Face recognition vendor test 2002   supplemental reportFace recognition vendor test 2002   supplemental report
Face recognition vendor test 2002 supplemental report
 
Integrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
Integrating IoT Sensory Inputs For Cloud Manufacturing Based ParadigmIntegrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
Integrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
 
2012-02-17_Vojtech-Seman_Rigorous_Thesis
2012-02-17_Vojtech-Seman_Rigorous_Thesis2012-02-17_Vojtech-Seman_Rigorous_Thesis
2012-02-17_Vojtech-Seman_Rigorous_Thesis
 
High Performance Traffic Sign Detection
High Performance Traffic Sign DetectionHigh Performance Traffic Sign Detection
High Performance Traffic Sign Detection
 
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingBig Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
 
Big data-and-the-web
Big data-and-the-webBig data-and-the-web
Big data-and-the-web
 
A Seminar Report On NEURAL NETWORK
A Seminar Report On NEURAL NETWORKA Seminar Report On NEURAL NETWORK
A Seminar Report On NEURAL NETWORK
 
Alinia_MSc_S2016
Alinia_MSc_S2016Alinia_MSc_S2016
Alinia_MSc_S2016
 
outiar.pdf
outiar.pdfoutiar.pdf
outiar.pdf
 
LATENT FINGERPRINT MATCHING USING AUTOMATED FINGERPRINT IDENTIFICATION SYSTEM
LATENT FINGERPRINT MATCHING USING AUTOMATED FINGERPRINT IDENTIFICATION SYSTEMLATENT FINGERPRINT MATCHING USING AUTOMATED FINGERPRINT IDENTIFICATION SYSTEM
LATENT FINGERPRINT MATCHING USING AUTOMATED FINGERPRINT IDENTIFICATION SYSTEM
 
Automatic Detection of Performance Design and Deployment Antipatterns in Comp...
Automatic Detection of Performance Design and Deployment Antipatterns in Comp...Automatic Detection of Performance Design and Deployment Antipatterns in Comp...
Automatic Detection of Performance Design and Deployment Antipatterns in Comp...
 
dissertation
dissertationdissertation
dissertation
 

Thesis_Nazarova_Final(1)

  • 1. A THESIS PRESENTED FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTATIONAL SCIENCE UNCERTAINTY ANALYSIS OF PREDICTIONS BY RECOMMENDER SYSTEMS BASED ON MATRIX FACTORIZATION MODELS Author: SARDANA NAZAROVA 11106514 Supervisors: DR. FABIAN JANSEN DR. DRONA KANDHAI DR. VALERIA KRZHIZHANOVSKAYA Committee members: DR. DRONA KANDHAI DR. FABIAN JANSEN DR. MICHAEL LEES DR. VALERIA KRZHIZHANOVSKAYA IOANNIS ANAGNOSTOU (MSC.) SEPTEMBER 2016
  • 2. Abstract In this work a recommender system for financial markets products based on Factorization Machines is considered. In practice, Factorization Machines are used to give predictions as point numbers without any uncertainties around them. When giving recommendations, however, it is crucial that some quan- titative indication of the accuracy of the recommendations is given, so that those who use it can assess their reliability. While recommender systems mostly provide only the overall accuracy using a validation dataset, our approach is to give accuracies on all of the individual predicted preferences. In this thesis, an algorithm that estimates the accuracy of individual predictions made by recommender systems based on matrix factorization models is created and tested. The method is furthermore applied on emerging market bonds.
  • 3. Acknowledgements I would like to express my deep gratitude to my supervisors, Drona Kandhai, Fabian Jansen, and Valeria Krzhizhanovskaya for the patient guidance, the valuable comments, remarks and help. Special thanks to my daily supervisor Fabian. His guidance, inspiration and enthusiastic encouragement made this thesis work possible. I am extremely grateful to Fabian for introducing me to the world of Data science and for giving me the opportunity to work with an awesome team of data scientists. I would like to thank the Wholesale Banking Advanced Analytics team at ING Bank for taking me as an intern and treating me as an equal. I would like to express my very great appreciation to Alexander Boukhanovsky, Michael Lees and Valeria Krzhizhanovskaya for their continuous help throughout the two years at ITMO and UvA. I owe my sincere thanks to Elena Nikolaeva and Mendsaikhan Ochirsukh for the friendship, kindness and sense of humour. Last but not least, I wish to thank my parents Anna Nazarova and Dmitriy Nazarov for their love and unconditional support. Sardana Nazarova, September 16, 2016
  • 4. Table of contents Introduction 3 1 Recommender systems 5 1.1 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Explicit and implicit feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Confidence in the recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Matrix factorization models 8 2.1 Basic matrix factorization model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Factorization Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 ALS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.3 MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Uncertainties estimation 13 3.1 Uncertainties for matrix factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.1 Precision matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.2 Covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1.3 Individual prediction uncertainties . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.4 Estimation of normalization factor . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.5 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 Uncertainties for Factorization Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4 Uncertainty analysis using synthetic data 21 4.1 Test methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2 Test of various rating matrices and noise levels . . . . . . . . . . . . . . . . . . . . . . 25 4.3 Test of various sparsities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.4 Estimation of a normalization factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.4.1 Determining d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5 Calculations on real data 34 5.1 Financial market data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.2 Rating matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.3 Implicit feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.4 Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.4.1 Preference to buy or sell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.4.2 Preference to trade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 1
  • 5. 5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6 Conclusions and future work 47 Bibliography 48 A Minimization methods test 50 B Uncertainties for attribute-aware model 58 2
  • 6. Introduction The Wholesale Banking Advanced Analytics team in ING has introduced a recommender system for trading financial markets products like commodities, bonds and foreign exchange products. The current recommender system is based on Factorization Machines, particular the software package libFM [1]. However, Factorization Machines give predictions as point numbers without any uncertainties around them. Hence, ING has no real handle on the accuracy of the recommendations. A usual way of estimating accuracy is splitting data into train and test sets. However, this method is not suitable in our case. The reason is that real data on financial markets is highly sparse, so that splitting makes the data even sparser. Recommender systems generally give recommendations in order of some estimated quantity, a “pref- erence” or a “like”. When giving recommendations, it is crucial that some quantitative indication of the accuracy of the recommendations is given, so that those who use it can assess their reliability. While recommender systems mostly give only overall accuracy using a validation dataset, our approach is to give accuracies on all of the individual predicted preferences. Without such an indication, it is difficult to compare predictions, either among themselves or with other given values. It is essential for decision makers to know how much they can rely on a prediction, as predictions 5±4 and 5±0.2 are different. It is possible that other models than Factorization Machines are more suitable for financial markets data. However, the focus of this thesis is on matrix factorization models since they are used within ING and due to the limited time that is available only basic matrix factorization models and Factorization Machines are considered. There are two optimization methods used to train factorization models: stochastic gradient descent (SGD) and alternating least squares (ALS) [2]. LibFM also has Markov Chain Monte Carlo as an option. Since matrix factorization is a non-convex problem, these methods could converge to a local minimum. These methods and their stability are tested and compared. Overall, the objective of the research focuses on the creation and testing of an algorithm that estimates the accuracy of individual predictions made by recommender systems based on matrix factorization models. The algorithm additionally includes tools for estimating the overall noise level in input data. Problem Statement. The main research question is to formulate a measure of uncertainty for matrix factorization based recommender systems and apply a Factorization Machine with uncertainty estimate on financial markets products of ING. The research questions are: • Formulate a measure of uncertainty for matrix factorization based recommender systems. • Test this measure on synthetic and real data. Thesis structure. The thesis is structured as follows: the first chapter gives a brief description of the different approaches of recommender systems. The next chapter has descriptions of matrix factorization models, namely basic matrix factorization, attribute-aware model, and a more detailed description of Factorization Machines. In addition, this chapter also gives specifications for optimization methods. Chapter 3 provides the algorithm for estimating an uncertainty on individual recommendations for the matrix factorization model and for the more general Factorization Machine. Herein we also explain a way to estimate the noise level in data. The result of testing the algorithm using simulated data is given 3
  • 7. in the next chapter. Chapter 5 shows the result of applying the method on financial markets data. The conclusions and recommendations for future work are presented in the last chapter. 4
  • 8. Chapter 1. Recommender systems With the increasing amount of available data, users face the problem of finding important data. For ex- ample, it can take a lot of time for users to find products in a huge catalogue in an internet shop. Thus recommender systems become essential to filter data, when they are able to make personalized recom- mendations of possibly useful items to a user ([3], [4], [5], [2]). Automatically recommending items to users has been of academic interest over the last 20 years, since recommender systems became an independent research area in early 1990s. There are multiple competi- tions and conferences on recommender systems. The leading competition is the KDD Cup, data-mining competition [6]. It is an integral part of the annual ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining (KDD). Recommender systems are not only of academic interest but essential for industry. The most prominent example is Netflix, which had a huge impact on the field with their competition to improve their recommender technology, i.e. the Netflix Prize. The Netflix Prize competition has stimulated a great deal of high quality research. So the KDD Cup’07 focused on predicting aspects of movie rating behaviour, employing the Netflix dataset [6]. The recommendation problem. A recommender system’s core function is to identify useful items for the user. In order to do this a recommender system must be able to predict the utility of items, and then decide which items are worth recommending [3]. “Preferences” or “likes” are predicted using data of users, items, the past history of user preferences and other data. Commonly said, the recommendation problem is estimating unknown “ratings”, which are the preferences or likes expressed by users for items. General procedures for estimating ratings are described in [3]. Rating is represented as a user, item, rating triple (ci,sj,rij). A user is defined with a profile ci that includes user attributes, such as gender, age, etc. It also can be just a unique identifier. The same is true for an item sj. Then the general rating estimation procedure can be defined as ˆrij = uij {R,c,s}, where R = rij = /0 , c - user profiles, s - item profiles and uij - the utility function. 1.1 Approaches Depending on the utility function there are three approaches in recommender systems: content-based, collaborative filtering, and hybrid [3]. The content-based approach recommends items to a user that are similar to the ones the user has rated before. Oppositely, collaborative filtering relies only on the similarity in the past behaviour of the users. Hybrid recommenders combine collaborative and content- based approaches. Content-based systems calculate the similarity between an item and items that the user previously rated, and recommend the best-matching ones. For that purpose the system uses item profiles. User preferences are built on the basis of items previously rated by the user. Once user and item profiles are available, similarities can be calculated using, for example, the cosine similarity measure. uij = cos(ci,sj) = ci ·sj ||ci||2 ·||sj||2 5
  • 9. As an example, in keyword analysis sj can be some weights of keywords of a document j and ci be the importance of these words to user i. As a result the user will get recommendations of documents from his field of interest. Other machine learning techniques such as decision trees and artificial neural networks are used as well. For examples of content-based methods look at [4], providing a thorough discussion on a wide variety of topics and methods. However, content-based strategies suffer from limited content analysis problems (requiring external information that might be hard to collect or are unavailable and that two different items with the same set of features are not distinguishable) and over-specialization (recommendations tend to be limited by items similar to those already rated). Collaborative filtering bases its predictions on ratings by other users. There are two areas of collabo- rative filtering algorithms: memory-based and model-based [5]. Memory-based (nearest neighbor) methods rely on relations between items or between users. User- oriented ones measure a user rating for an item based on all the ratings of “neighboring” users for the item. Finding “neighboring” (like-minded) users is done by evaluating the similarity (correlation, co- sine, their modifications, or other) between two users on their ratings of items that both users have rated [3]. The same methods are used for calculating item similarities in item-oriented methods. As described in [4] memory-based methods suffer from some disadvantages. The main drawback of user-oriented methods is that they would not perform well in case of high sparsity of ratings R, where there are few coratings, since the similarity measure needs sets of items rated by both users. Another disadvantage is expensive time- and memory-consuming calculations of user/item neighborhoods, as it requires com- parison against all other users/items. There are several techniques, such as clustering and subsampling, used to reduce time or memory consumption in user-oriented methods. For item-oriented methods the consumption is reduced by only storing the top n correlations for each item and calculating correlations for item pairs with more than k coratings, though it reduces the accuracy of the predictions. Also expen- sive similarity calculations can be done offline like the Amazon recommender [7]. The problem of large sparse ratings data with few coratings may be solved with dimensionality reduction algorithms, mapping the underlying data to a latent space of smaller dimensionality [4]. The most well-known dimension- ality reduction algorithm is matrix factorization, which belongs to model-based methods. Model-based methods use the collection of ratings to learn a model from data using statistical and machine learning techniques and then apply this model to get predictions. 1.2 Explicit and implicit feedback Recommender systems use different types of data. Those are categorized into explicit feedback and implicit feedback. The first uses explicitly given user preferences showing how much a user likes or dislikes an item on some rating scale. For example, the Netflix Prize data are ratings on a scale from 1 (“totally dislike”) to 5 (“very like”) [6]. Explicit feedback is more favourable, since it represents user preferences fully. However, it is not always possible to collect explicit feedback. When this is the case, user preferences may be deduced indirectly by recommender systems from user actions, for instance purchase history. Unfortunately, implicit feedback has characteristics that may prevent the direct use of algorithms designed for working with explicit feedback [8]. The numerical value of implicit feedback describes the frequency of actions, indicating confidence. Recurring events are more likely to reflect user opinion, thus by observing a user’s behaviour it may be inferred which items that user probably likes, yet it is only a guess if the user actually likes the item. The main problem is that it is mostly impossible to reliably infer which items a user did not like. The fact that a user has no relations with an item usually means the user hasn’t known about that item, hence implicit feedback does not represent negative preferences. Preferences rui of user u towards item i may be get from implicit feedback assuming that high val- ues of observations mean stronger preference. For example, rui can indicate the number of times user u purchased item i. There are more complicated ways to get preferences. For instance, Hu, Koren and 6
  • 10. Volinsky [8] build preferences from user behaviour using confidence. They introduced two sets of vari- ables indicating preference and the confidence level in observing that preference. A set of variables pui shows the preference of user u to item i, i.e. if u consumed i, it means u likes i (pui = 1), otherwise there is no preference (pui = 0): pui = 1 if rui > 0 0 if rui = 0 These preferences have different confidence levels. Usually higher rui (more frequent) means stronger indication that the user likes the item. For a measure of confidence in observing pui two equations are proposed: cui = 1 + αrui and cui = 1 + α log(1 + rui/ε). The cost function for minimization procedure is changed to min∑u,i cui (pui − ˆpui)2 . Financial markets data does not give explicit feedback and ratings should be derived from implicit feedback. 1.3 Confidence in the recommendation It was shown in [9] that providing an explanation on the recommendation may influence users to make the right decision. Explanations provide a mechanism for handling errors that come with a recommendation. One of explanations is the confidence in the recommendation. The confidence in the recommendation is the system’s trust in its recommendations or predictions [10]. Presenting confidence in predictions can provide valuable information to users in making their decisions. Depending on an algorithm used in a recommender and a task, the confidence in the prediction may be obtained differently [11]. In general, collaborative filtering algorithms become more confident in a user predictions when there are more known ratings by the user, the same for items. Hence, a confidence measure can simply be associated with the amount of information over a user and an item in the dataset. It can be interpreted as the recommendation strength - strong or weak, when the system is confident or unsure if the item is appropriate in the recommendation respectively. The most common measurement of confidence is the confidence interval. The probability distribu- tion of ratings for an item is obtained differently in memory-based and model-based methods [12]. In memory-based methods, the rating’s distribution is obtained from the data. As an example, in nearest- neighbor techniques the rating’s distribution obtained based on the similarity of the k nearest users to the user. For model-based methods it is possible to obtain the rating’s distribution from the model itself using uncertainty quantification techniques. Matrix factorization models do not provide any information on the uncertainty and the confidence of the recommendation [13]. Karatzoglou and Weimer introduced an algorithm that estimates the con- ditional quantiles of the ratings in matrix factorization using quantile regression. It belongs to intrusive uncertainty quantification methods. This thesis focuses on non-intrusive uncertainty quantification meth- ods, since intrusive methods require reformulating the mathematical models. Measuring the quality of confidence is difficult. In [14] two possible ways of evaluating confidence bounds are given. If a recommender is trained over a confidence threshold α, for instance 95%, and produces, along with the predicted rating, a confidence interval, the true confidence can be computed as αtrue = n+ n−+n+ , where n+ and n− are the number of times that predicted ratings were within and outside the confidence interval respectively. The true confidence should be close to the requested α confidence. Another one is in filtering recommended items where the confidence in the predicted rating is below some threshold. Then the prediction accuracy can be estimated for different filtering thresholds. The recommendations made with high confidence, in general, should be more accurate than those made with lower confidence. 7
  • 11. Chapter 2. Matrix factorization models Matrix factorization models try to model ratings by characterizing users and items by latent factors. Matrix factorization models, considered in this work, are a basic matrix factorization, an attribute-aware model and Factorization Machines. The basic one uses only ratings data and hence represents model- based collaborative-filtering approach; an attribute-aware model also incorporates content data and can be considered as hybrid approach; the Factorization Machine is able to generalize most of all state-of- the-art matrix factorization models. 2.1 Basic matrix factorization model In recommender systems based on collaborative filtering input data can be represented as a matrix of ratings R ∈ Rn×m: R =      r11 r12 ··· r1 r21 r22 ··· r2m ... ... ... ... rn1 rn2 ··· rnm      , where n is a number of users, m is a number of items, and rij represents the rating given by user i to item j. Usually R is a sparse matrix, since users are likely to give ratings to only a small number of items compared to the total number of items available. In matrix factorization models it is assumed that a matrix of ratings can be represented by a small number of latent factors. This allows matrix factorization to estimate factors even in highly sparse data. Matrix factorization models map both users and items to a joint latent factor space of dimensionality d, such that user-item interactions are modeled as inner products in that space [2]. Rn×m An×d ·Bd×m (2.1) Each item i is associated with a vector ai ∈ Rd representing latent factors of the item. Each user j is Figure 2.1: Matrix factorization example associated with a vector bj ∈ Rd, which may be considered as indications of how much the user prefers each of the d latent factors (see an example in figure 2.1). Hence, each user-item interaction, including 8
  • 12. unknowns, can be approximated as: ˆrij = ai ·bj (2.2) As some users tend to give higher/lower ratings than the average and some items have given in average higher/lower ratings, biases also should be included in a model. The biased matrix factorization model is ˆrij = wi +wj +ai ·bj (2.3) where wi denotes the bias for the user and wj for the item. However, it was shown in [15] that the bias terms are not necessary, because they can be incorporated directly into matrix factorization. They proposed a BRISMF model (Biased Regularized Incremental Simultaneous Matrix Factorization) which can incorporate the bias terms by fixing d +1 column of A and d +2 row of B to the constant value of 1. Then (2.3) becomes ˆrij = wi +wj + d ∑ f=1 a f i b f j = d+2 ∑ f=1 a f i b f j , where a f+1 i = 1, b f+1 j = wj, b f+2 j = 1, a f+2 i = wi. Generalization to an attribute-aware model. Treating every user as a combination of user attributes in the form of a f i = ∑n s=1 us i a f s , and every item as a combination of item attributes b f j = ∑m q=1 p q jb f q, leads to the model (2.2) to be generalized into an attribute-aware model [16]. The latent factors part (without biases or unary interactions) of the attribute-aware model is ˆrij = d ∑ f=1 n ∑ s=1 us i af s · m ∑ q=1 p q jbf q (2.4) where u1×n i , p1×m j are user i and item j attribute vectors, an×1, bm×1 are latent factors, d is the number of latent dimensions. In matrix notation this is ˆR = (UA)·(PB)T (2.5) The model (2.5) becomes (2.2), if U ∈ Rn×n and P ∈ Rm×m are identity matrices with sizes n, m equal to a number of users and items respectively: UA =      1 0 ··· 0 0 1 ··· 0 ... ... ... ... 0 0 ··· 1      ×      a1 1 a2 1 ··· ad 1 a1 2 a2 2 ··· ad 2 ... ... ... ... a1 n a2 n ··· ad n      =      a1 1 a2 1 ··· ad 1 a1 2 a2 2 ··· ad 2 ... ... ... ... a1 n a2 n ··· ad n      In this representation for user i, being a user i is its attribute itself. Generalization to a Factorization Machine. The difference between an attribute-aware model and a more general Factorization Machine is that the latter contains additional interactions between user at- tributes (e.g., between age and salary) and item attributes [1]. ˆrij = ∑ f       n ∑ s=1 us i af s · m ∑ q=1 p q jbf q Attribute-aware model + n ∑ s=1 us i af s · n ∑ s’=s+1 us’ i a f s’ User attr-s interactions + m ∑ q=1 p q jbf q · m ∑ q’=q+1 p q’ j b f q’ Item attr-s interactions       (2.6) 9
  • 13. If user and item attributes are combined into one vector xij = ui, pj , the above equation becomes ˆrij = ˆy(xij) = k ∑ s=1 k ∑ s’=s+1 xsxs’ d ∑ f=1 vf s v f s’ (2.7) where k = n+m, v = a,b . 2.2 Factorization Machines The Factorization Machine is a general approach able to mimic most factorization models using feature engineering [1]. It models interactions between all input variables using factorized interaction parame- ters. Figure 2.2: Example (from [1]) for representing a recommender problem with a design matrix X and a vector of targets y. Every row represents a feature vector xi with its corresponding target yi The input data for the Factorization Machine is described by a design matrix X ∈ Rn×k and a vector of targets y ∈ Rn (figure 2.2). Each row xi ∈ Rk of X is a vector of real-valued features, describing one case with k variables, and yi is the prediction target of that case. The variables in X may be binary indicators, user and item attributes, time indicators or any real-valued features. The model is ˆy(x) = w0 + k ∑ s=1 wsxs + k ∑ s=1 k ∑ s’=s+1 xsxs’ d ∑ f=1 vf s v f s’, where d is the dimensionality of the factorization and the model parameters are w0 ∈ R,w ∈ Rk,V ∈ Rk×d. The first part of the model is an overall bias and the unary interactions of each input variable with the target (like in a linear regression model). The second part contains all pairwise interactions of input variables with factorized weights, assumedly having a low rank. The Factorization Machine model is greatly adjustable. One can choose whether to include bias and/or unary interactions and the number of the dimensions of the factorization. Also, adjustment can be done by specifying differently the design matrix X. For example, a basic matrix factorization model ˆR = A·BT can be obtained by not including bias and unary interactions and constructing x ∈ R|A|+|B| with binary indicator variables (for every known rij the feature vector is constructed as xi = 1 : i = {|u|,|A|+ p},otherwise 0). The model is then: ˆrij = ˆy(xij) = d ∑ f=1 v f i v f j With bias and unary interactions this model becomes a biased matrix factorization model (see eq. 2.3): ˆrij = ˆy(xij) = w0 +wi +wj + d ∑ f=1 v f i v f j 10
  • 14. Though the bias terms may improve accuracy, it will not be included in further models for testing and calculations of uncertainties as they may be included in the factorization part itself. 2.3 Optimization algorithms Due to the high sparsity of the input data it is challenging to learn model parameters. The state-of- the-art models are learned only on the observed cases. Addressing only a few known cases is prone to overfitting, but this may be avoided by regularization (see the effect of regularization in table A.3). The model is learned by minimizing the cost function - the regularized sum of losses l over the observed data S: L(S) = argmin Θ ∑ (x,y)∈S l (ˆy(x|Θ),y)+ ∑ θ∈Θ λθ θ2 , (2.8) where Θ are the model parameters, ˆy(x|Θ) is a prediction depending on chosen Θ. Here L2 regularization is applied. A loss functions represent the price paid for an inaccuracy of predictions. In Factorization Machines a least-squares loss function (L2 norm) is chosen for a regression task: l (ˆy(x|Θ),y) = (ˆy(x|Θ)−y)2 , for a binary classification task y ∈ {−1;1} it is a logistic function: l (ˆy(x|Θ),y) = ln 1+exp−ˆy(x|Θ)y . Learning algorithms that Factorization Machines use are based on stochastic gradient descent (SGD), alternating least-squares (ALS), and Bayesian inference using Markov Chain Monte Carlo (MCMC). 2.3.1 SGD The stochastic gradient descent algorithm is popular for optimizing factorization models. It is simple and computationally light. The algorithm iterates over known cases (x,y) ∈ S and updates the model parameters θ ∈ Θ in the opposite way of the loss function gradient θ ← θ −η ∂l(ˆy(x|Θ),y) ∂θ +2λθ θ , (2.9) where η is the learning rate, λθ are regularization values. The prediction quality highly depends on the regularization values (figure A.3). Those are usually searched by time-consuming grid searches that require learning the model parameters multiple times. To overcome this issue Rendle [17] introduced SGD with adaptive regularization (SGDA). This method adapts the regularization values automatically, while training the model. SGDA has two alternate steps, the first one is updating the model parameters from a train set (2.9), the second one is updating the regularization values from a validation set: λ ← λ −α ∂l(ˆy(x|λ),y) ∂θ However, the main shortcoming of SGD and SGDA over complex algorithms is that they highly depend on a learning step η. Though it is proved that gradient descent converges with infinitesimal steps for low-rank matrix approximation of rank higher than one, if η is too small, the algorithm will converge slowly; if η is high, it may not find any minimum (figures A.1, A.2). 11
  • 15. 2.3.2 ALS It is shown in [18] that alternating least squares works well for low-rank matrix reconstruction. ALS iterates over all parameters minimizing the loss per model parameter until convergence. The optimal value of one parameter may be calculated directly if all remaining parameters are fixed, since the eq. (2.8) becomes a linear least-square problem for one parameter. While in general ALS iteration has higher runtime complexity, implementation in Factorization Machines compute this with the same complexity as iteration in SGD due to caching [19]. An advantage of ALS over SGD is that ALS does not use the learning rate. 2.3.3 MCMC Both ALS and SGD algorithms learn optimal parameters Θ, which are used for getting a point esti- mate of ˆy(x|Θ). Markov Chain Monte Carlo generates the distribution of ˆy by sampling. ALS can be considered as simplified MCMC without sampling — while MCMC samples parameters from the pos- terior distribution θ ∼ N (µ∗ θ ,σ∗ θ 2 ), ALS uses the expected value θ = µ∗ θ . MCMC inference is simpler to apply than other methods, since it has fewer parameters to adjust. It automatically determines the regularization parameters ΘH by placing a prior distribution on them. The algorithm iteratively draws samples for hyperparameters ΘH and model parameters Θ. The advantage of MCMC over SGD and ALS is that MCMC takes the uncertainty in the model parameters into account. The disadvantage is that MCMC has to store all generated models, which are used to calculate predictions. For large scale models like the Factorization Machine, saving the whole chain of models is very storage intensive [20]. libFM’s implementation of MCMC doesn’t store the set of models, but sums up predictions on every step and at the end gives an average as a result. That’s why to get reliable predictions, MCMC needs much more iterations than ALS. Tests of listed methods are shown in the appendix A. SGD methods highly depends on a learning step η: if η is too small, the algorithm converges slowly; if η is high, it may not find any minimum. For low sparsity, the high value of regularization parameter λ prevents SGD from finding optimum and makes ALS more stable. For high sparsity, using regularization increases accuracy. SGDA performs as good as SGD with a proper chosen regularization parameter value. MCMC performs significantly better than other examined methods both for dense and sparse matrices. 2.4 Software There are several implementations of Factorization Machines. The most used one is Steffen Rendle’s library libFM [1]. libFM is written in C++ and there is a Python wrapper for libFM, pywFM, which pro- vides a full functionality of libFM. There are other Python implementations of Factorization Machines: pyFM and fastFM. It is possible that fastFM works faster than libFM, since the performance critical code in fastFM is written in C and wrapped with Cython, however there is no literature on pyFM and fastFM performance and reliability. Factorization machines are computationally expensive to scale to large amounts of data and large numbers of features. These problems are solved with distributed imple- mentations of Factorization Machines – DiFacto [21]. When using multiple workers DiFacto converges significantly faster than libFM. In this work libFM is used, since it is most reliable and has a Python wrapper. 12
  • 16. Chapter 3. Uncertainties estimation ˆR = A · BT (2.1) gives an estimate of all ratings, but it can be quite inaccurate due to uncertainties. For example, the estimate can be rij = 4 ± 0.1 or rij = 4 ± 3. The former is much more reliable. Not only should an estimate be provided, but also the uncertainty on that estimate. The first section of this chapter focuses on an analytical algorithm to estimate uncertainties on indi- vidual predictions for a basic matrix factorization model described in section 2.1, which is a particular case of Factorization Machines. Then the equations are generalized to Factorization Machines (section 3.2). In the analytical algorithm, based on the Maximum likelihood estimator for the parameter variance, primarily uncertainties on parameters are determined, then using linear approximation the uncertainties on individual predictions are calculated. Another method of estimating uncertainties is a Bootstrapping method (section 3.3). In the Boot- strapping method the uncertainties can be estimated directly without computing parameter variances and without linearization. 3.1 Uncertainties for matrix factorization Maximum likelihood method is a method of estimating unknown parameters whose values maximize the probability of obtaining the observed sample [22]. The likelihood is the value of a probability density function evaluated at the measured value of the observables: L(Θ) = f(r ≡ rknown|Θ), (3.1) where Θ is a set of parameters. The joint density function for independent and identically distributed samples is f(r ≡ rknown|Θ) = ∏ij f(rij|Θ), therefore L(Θ) = ∏ij f(rij|Θ),∀rij ∈ rknown. For convenience, it is better to work with log-likelihood: lnL(Θ) = ∑ij ln f(rij|Θ). The function L can be maximized by setting to zero a first partial derivative by θ ∈ Θ and finding the obtained equations’ solutions: ∂ ∂θ lnL(Θ) θ= ˆθ = 0 From Central limit theorem it can be assumed that the error on rij is Gaussian and, therefore, each data point is described with a Gaussian probability function. Using a Gaussian probability density function, lnL(Θ) can be written as: lnL(Θ) = − 1 2 ∑ ij rij − ˆrij(Θ) σ 2 = − 1 2 χ2 (3.2) χ2 = ∑ ij rij − ˆrij(Θ) σ 2 = RSSmin σ2 , (3.3) where RSSmin = ∑ij rij − ˆrij( ˆΘ) 2 ,∀rij ∈ rknown. In this case maximizing L(Θ) is equivalent to minimizing the chi-squared value χ2. Assuming that 13
  • 17. the log-likelihood function is a parabola around the minimum, parameter variances can be calculated in a closed-form solution. An error on a parameter is defined as the change of the parameter which produces a change of the function χ2 value equal to 1 [23] (figure 3.1). Then with error-propagation the Figure 3.1: Parabolic error uncertainties on individual recommendations are calculated (section 3.1.3). The Maximum likelihood estimator for the parameter variance is estimated from the inverse of the second derivative of −lnL(θ) at the minimum, e.g. θ = ˆθ. E−1 = − ∂2 lnL ∂2θ −1 = 2 ∂2χ2 ∂2θ −1 (3.4) 3.1.1 Precision matrix A second order Taylor’s expansion of χ2 around a minimum is: χ2 (Θ) χ2 ( ˆΘ)+ 1 2 ∑ i ∑ j ∂2χ2 ∂θi∂θj ˆΘ (θi − ˆθi)(θj − ˆθj) In matrix notation χ2 (Θ) χ2 ( ˆΘ)+ Θ− ˆΘ T ·E· Θ− ˆΘ with Eij = 1 2 ∂2χ2 ∂θi∂θj ˆΘ . E is the precision (inverse covariance) matrix. The non-zero elements of the precision matrix are: ∂2χ2 2∂2al = − 1 σ2 ∂ ∑m j=1 rl j −albj bj ∂al = 1 σ2 m ∑ j=1 b2 j,∀rl j ∈ rknown ∂2χ2 2∂2bk = − 1 σ2 ∂ (∑n i=1 (rik −aibk)ai) ∂bk = 1 σ2 n ∑ i=1 a2 i ,∀rik ∈ rknown ∂2χ2 2∂al∂bk = − 1 σ2 ∂ ∑m j=1 rl j −albj bj ∂bk = 1 σ2 (2albk −rlk) ≈ 1 σ2 albk In more general case, when there are d factors, the non-zero elements of E are: Ea f l ,a g l = ∂2χ2 2∂a f l ∂ag l = − 1 σ2 ∂ ∑m j=1 rl j −∑d p=1 ap l bp j b f j ∂ag l = 1 σ2 m ∑ j=1 b f j bg j (3.5) 14
  • 18. Eb f k ,b g k = − 1 σ2 ∂2 lnL ∂b f k ∂bg k = 1 σ2 n ∑ i=1 a f i ag i (3.6) Ea f l ,b g k = − 1 σ2 ∂2 lnL ∂a f l ∂bg k = 1 σ2 ag l b f k (3.7) Here indices of E show the elements in E corresponding to their variables. σ is an unknown normaliza- tion factor (section 3.1.4). The parameter variances will be proportional to this factor [23]. The precision (inverse covariance) matrix E ∈ R(n+m)·d×(n+m)·d is constructed using (3.5), (3.6) and (3.7). The precision matrix is a symmetric positive semi-definite matrix of the second order partial derivatives of a multivariate function −lnL, at the solution point. It is a measure of the uncertainty of the least-squares problem through its relationship to the inverse error covariance. 3.1.2 Covariance matrix A covariance matrix is the inverse of the previous section’s precision matrix: ∆θi∆θj = E−1 , where ∆θ are a variances in the parameters θ ∈ Θ. The variance on parameters are diagonal elements of the covariance matrix, and the uncertainties are square roots of the variances: ∆θi = [E−1]ii (3.8) Unfortunately, in this case the precision matrix is singular due to non-independent parameters. Parameters are not independent. In general, for the matrix factorization model, the rank of inverse covariance matrix E ∈ R(n+m)·d,(n+m)·d is equal to (n+m)·d −d2 , (3.9) where d is the number of latent factors dimensions. There are d2 non-independent parameters. This can be proved as follows. Let R ∈ Rn×m be a rating matrix, factorized into the users’ and items’ latent factors matrices A ∈ Rn×d and B ∈ Rm×d (eq. 2.1), so R = A·BT . Then for an arbitrary nonsingular matrix C ∈ Rd×d: R = A·BT = ACC−1BT = (AC)(C−1BT ). Since C has d2 elements, there are d2 parameters, which can be manipulated without changing the product A·BT . The most trivial case showing non-independent variables is that in R = A·BT dividing A by any non-zero real number k and multiplying BT to the same number k does not change the multiplication result A k · kBT = A·BT (see an example on a figure 3.2). Solution. To solve the problem, dependent parameters should be fixed. The proposed method for getting the pseudo-inverse E+ is to eliminate zero eigenvalues in the eigendecomposition of the ma- trix. As the eigendecomposition is a particular case of SVD for a positive semi-definite normal matrix, the pseudo-inverse may be calculated by the Moore-Penrose pseudo-inverse of a matrix using its sin- gular value decomposition (SVD). If E = UΣVT is the SVD of E, then the pseudo-inverse of E is E+ = UΣ+VT , where Σ+ is the diagonal matrix consisting of the reciprocals of E’s singular values, followed by zeros. 15
  • 19. Figure 3.2: Example of dependent parameters Eigendecomposition of a matrix. If E is a symmetric (n×n) matrix with n linearly independent eigen- vectors, then E can be factorized as E = QΛQ−1 , (3.10) where Q is the square matrix whose ith column is the eigenvector qi of E and Λ is the diagonal matrix whose diagonal elements are the according eigenvalues, i.e., Λii = λi. The eigenvalues are non-negative because E is positive semi-definite. Then the inverse of E is given by E−1 = QΛ−1 Q−1 , where Λ−1 ii = λ−1 i . Zero eigenvalues truncation scheme. As there are d2 zero eigenvalues, E is not invertible in the math- ematical sense. To fix dependent parameters, zero eigenvalues must be truncated by setting their inverse to zero. Λ−1 ii = 1 λi if λi = 0 0 if λi = 0 This method was used in [24] to remove not valuable components. Interpretation. As it was assumed before, variables r1,··· ,rn are normally distributed. They may be correlated in a normal distributed way. So the joint probability is P(x) ∼ exp−1 2 [r1,···,rn]·E·[r1,···,rn]T = exp−1 2 rT ·E·r , (3.11) where ri =< ri > +∆ri and E is the inverse of the covariance matrix of the variables r, also known as the error matrix. Usually in real cases r are statistically not correlated, so E is diagonal matrix E =    1/σ2 1 ··· 0 ... ... ... 0 ··· 1/σ2 n   , and P(r) ∼ exp −1 2 ∑n i=1 r2 i σ2 i , where σ2 i is the inverse of the diagonal element of Eii. After substituting the eigendecomposition (3.10) of a matrix E = QΛQ−1 into (3.11) we have P(r) ∼ exp−1 2 rT ·Q·Λ·Q−1·r. Defining y = Q−1 ·r ⇐⇒ r = Q·y gives P(y) ∼ exp−1 2 yT ·Λ·y. 16
  • 20. Since Λ is the diagonal matrix of eigenvalues, y’s are independent: P(y) ∼ exp−1 2 ∑n i=1 λiy2 i with λi = Λii the eigenvalues. If eigenvalues λ’s are sorted decreasingly and k ≤ n is a number of non-zero eigenvalues λ’s, then P(y) ∼ exp−1 2 ∑k i=1 λiy2 i . If one of the eigenvalues λi is zero, then accordingly yi can be any real number and may be fixed at zero. The rest of y’s have average yi = 0; variance (yi − yi )2 = 1/λi; covariance (yi − yi )(yj − yj ) = 0, since the y’s are independent. The r’s are rm = ∑i∈I Qmiyi – sum over y’s with corresponding non-zero eigenvalue, because the ones with zero eigenvalue were fixed at zero. Note: in programming languages, functions calculating eigenvalues usually give back floating point numbers close to zero for zero eigenvalues. It means comparing eigenvalues to 0 will not work properly. It is known that k = rank(E) variables are independent, what means only k eigenvalues are not zero. In practice, eigenvalues should be sorted decreasingly and the first k eigenvalues and their eigenvectors are taken. Other values should be considered to be 0. 3.1.3 Individual prediction uncertainties An uncertainty on an individual prediction stems from a standard deviation of the estimator rij of R, found using (2.2). As expectations and variances on parameters ai, bj are known, the derivation of a variance on rij can be done using the function (2.2). To find the variance of rij a first-order Taylor expansion of (2.2) is written: ˆrij = ai ·bj = d ∑ f=1 a f i b f j = d ∑ f=1 µ f i ν f j + d ∑ f=1 ∆a f i ν f j + d ∑ f=1 µ f i ∆b f j + d ∑ f=1 2∆a f i ∆b f j , where µ f i ,ν f j are expectations of variables a f i , b f j correspondingly, ∆a f i , ∆b f j are their variances. From this equation the variance on rij can be derived: ∆r2 ij = d ∑ f=1 a f i b f j − d ∑ f=1 µ f i ν f j 2 d ∑ f=1 ∆a f i ν f j + µ f i ∆b f j +2∆a f i ∆b f j 2 ∆r2 ij = ∑ f,g ν f j νg j ∆a f i ∆ag i +∑ f,g µ f i µg i ∆b f j ∆bg j +2∑ f,g µ f i νg j ∆ag i ∆b f j +O(∆3 ) (3.12) Here, ∆3 are not considered, for they are considered very small. Uncertainties on individual recommen- dations are the standard deviations ∆r2 ij. 3.1.4 Estimation of normalization factor The overall normalization factor σ can be calculated with using goodness-of-fit measure. If we assume a normally distributed population with a standard deviation σ, then the residual sum of squares (RSS), divided by σ2, has a chi-squared distribution with N degrees of freedom. χ2 = ∑ ij (rij − f(xij))2 σ2 ij RSS = ∑ ij (rij − f(xij))2 If f(x) describes the data then the expected value for the chi-squared χ2 = Ndof . If the function has been fitted to the data, then the fact that parameters have been adjusted to describe the data has to be 17
  • 21. accounted for. This leads to Ndof equalling the number of degrees of freedom, which is the difference between total number of data points and number of independent parameters: Ndof = Ndata −Nparams (3.13) If data uncertainties σ are not known, then the χ2 min of the fit provides an estimate for these. A rough estimate of σ is obtained by setting χ2 Ndof = 1 (3.14) From that it can be obtained σ2 RSS Ndof (3.15) Ndof must be positive number: if the number of independent parameters is higher than total number of data points, the model overfits and there is no need for calculating uncertainty at all. This leads to the requirement for estimation of measurement uncertainty: Ndata > Nparams (3.16) 3.1.5 The algorithm The analytical algorithm for calculating uncertainties on individual recommendations works with as- sumption that recommendations are normally distributed. Here, the uncertainty means ”one–standard deviation” error, so rij ±∆rij implies confidence level ≈ 68%. 1. Factorize matrix of ratings 2. Construct precision matrix E 3. Find covariance matrix V = E−1 (a) Find eigenvalues and eigenvectors of E (b) Calculate rank(E) (c) Calculate inverse of E using Eigenvalue truncation scheme 4. Calculate uncertainties on individual recommendations 5. Estimate normalization factor and normalize 3.2 Uncertainties for Factorization Machines Precision matrix. For the Factorization machine model the precision matrix is calculated the same way as for the matrix factorization model. The only difference is f(x) for calculating derivatives is given by equation (2.7). For derivation convenience, the equation (2.7) may be written as: r(x) = 1 2 d ∑ f=1   k ∑ l=1 v f j xj 2 − k ∑ l=1 v f j 2 x2 j   ∂r(x) ∂v f s = xs k ∑ j=1 xjv f j −xsvf s Its precision matrix is Ev f s ,v g q = ∑ i ∂r ∂v f s ∂r ∂vg q = ∑ i xs i x q i k ∑ l=1 xl iv f l −xs i vf s k ∑ l=1 xl ivg l −x q i vg q , where xi are feature sets of known data points. 18
  • 22. Covariance matrix. For factors dimension d = 1, E is an invertible matrix, since all parameters are independent, and a covariance matrix can be found easily as V = E−1. With higher factors dimension there are non-independent parameters and the precision matrix is singular: the zero eigenvalues trunca- tion scheme, described in 3.1.2, can be used to fix the dependent parameters. Individual recommendation uncertainties. As expectations and variances on parameters v are known, uncertainties on individual predictions ri can be calculated using the equation (2.7). First, the first order Taylor series approximation should be done: T (r(x;v)) = r(x;µ)+ d ∑ f=1 k ∑ l=1 (v f l − µ f l ) ∂r ∂v f l v=µ +O(∆2 ), (3.17) where O(∆2) is the remainder term representing all the higher terms, which is considered small. The approximation of the variance is Var(r) Var(T(r))+O(∆3 ) As all the components, except vl(l ∈ {1···k}), in the equation (3.17) are constants, the variance of y is Var(r) Var d ∑ f=1 k ∑ l=1 v f l ∂r ∂v f l v=µ It is the variance of a linear combination: Var ∑ f k ∑ l=1 v f l al = ∑ l,f a f l 2 Var(v f l )+ ∑ l,f=m,g a f l ag mCov v f l ,vg m (3.18) where a f l = ∂r ∂v f l v=µ . 3.3 Bootstrapping The bootstrapping is a special case of Monte Carlo simulations used for a specific purpose of obtaining an estimate of the sampling distribution by drawing many samples. Generally Monte Carlo (MC) method is a numerical method for solving mathematical problems using the simulation of random variables. It is commonly used in cases where analytical evaluation of errors is difficult or impossible. The idea of the method is to build the model of the measurement and launch it a large number of times, each with a different random number seed. The width of distribution of the measured values is then taken as the estimate of the measurement uncertainty. Some properties of MC: • MC estimate converges to a true value due to Law of large numbers. • MC estimate is asymptotically normally distributed due to Central limit theorem. As we want to know an uncertainties on predicted ratings ˆrij, firstly R is factorized: rij ≈ ˆrij = ∑d f=1 a0 if b0 j f , where d is a number of latent factors. ˆR is an estimate of the population, so an estimate of the sampling distribution can be obtained by drawing many samples. To find an uncertainty on each measurement ˆrij, the following model of the measurement is used: 1. Create new matrix R : rij = ˆrij + εij, where εij is a sample drawn from normal distribution with the mean at 0 and deviation equal to the normalization factor σ (for how to estimate σ see section 3.1.4). 19
  • 23. 2. Factorize R with d factors: ˆrij = ∑d f=1 al if bl j f . 3. Keep all approximations ˆrij . After repeating the steps above N times, there will be a sample of N approximations [ˆrij ] for every element rij (see example of the sample distribution in figure 3.3). If N is high enough, the standard deviation of the sample [ˆrij ] represents true deviation ∆rij due to the Law of large numbers. In principle, with Bootstrapping asymmetric uncertainties can be obtained, although here the uncertainty is assumed to be symmetric. Figure 3.3: Example of ˆrij distribution from Bootstrapping. ˆrij is a predicted rating and ∆rij is the standard deviation of the obtained sample [ˆrij ] 20
  • 24. Chapter 4. Uncertainty analysis using synthetic data The analytical algorithm described in a section 3.1 and the Bootstrapping method described in 3.3 are compared in different scenarios: various matrix sizes, latent factors dimensions, noise levels and spar- sities. Because the analytical algorithm is an approximation based on linearization and Gaussian noise, error calculations are tested with simulated data. 4.1 Test methodology Data simulation. First, a factorizable matrix of ratings R0 = A · BT should be initialized, A ∈ Rn×d and B ∈ Rm×d are matrices of latent factors for users and items respectively. Elements of A and B are randomly initialized from a uniform distribution. In real cases ratings rij usually have noise on top, so rij = r0 ij +εij, (4.1) where r0 ij ∈ R0 and εij is a sample drawn from a normal distribution with the mean at 0 and deviation equal to chosen σdata. R is randomly divided into known and unknown elements, which are training Rknown and test Runknown datasets respectively. Taking β% as a sparsity of R, (100 − β)% values of R are considered to be known. All parameters for initializing an input rating matrix are given in table 4.1. Other parameters used in tests are given in table 4.2. Table 4.1: Parameters for initializing input rating matrix Parameter Description n Number of users m Number of items d Number of latent factors dimensions ll, lr Uniform distribution limits: a f i ,b f j ∼ U(ll,lr) σdata Standard deviation of added Gaussian noise: εij ∼ N(0,σdata) β Rating matrix sparsity Table 4.2: Other parameters Parameter Description Method Method used for model training Iterations Number of training iterations Initialization std.dev. Standard deviation of a zero-mean normal distribution for initializing factors Simulations Number of simulations in the Bootstrapping Further below the following input rating matrix parameters are used (table 4.3), unless otherwise stated. Ratings r0 distributions for various number of latent factors dimensions are shown in figure 4.1. 21
  • 25. Table 4.3: Parameter values Parameter Value n 30 m 40 d 2 ll, lr -1, 1 σdata 0.2 β 50% Method ALS Iterations 2000 Initialization std.dev. 0.1 Simulations 50 Figure 4.1: Ratings r0 distribution for various latent factors dimensionality d. With higher d range of values expands. For d = 2 the values are approximately in the range (−2,2) Factorization. Rknown is an input matrix for the algorithm. Factorization of Rknown is made using the same number of factors d as in R initializaion. In the tests below it is assumed that d is known. However, in reality d should be found. Factorization of R gives predictions ˆrij = ∑d f=1 ˆa f i ˆb f j (figure 4.2). Since approximations ˆrij try to eliminate the noise N (0,σdata) in rij, the distribution of rij−ˆrij σdata ,rij ∈ Rknown should be a standard normal (figure 4.3). It is a check, that optimization procedure found the minimum: if the standard deviation of the distribution SD rij−ˆrij σdata >> 1, it means the model training failed and there is no need in finding prediction uncertainties. On the contrary, if the standard deviation SD rij−ˆrij σdata << 1, it means the model is overfitted. Analytical algorithm. If the model trained well, uncertainties are calculated using the analytical algo- rithm. The pull distribution of normalized residuals is expected to be also a standard normal, in that approximations ˆrij try to get to r0 ij: r0 ij − ˆrij ∆rij , where r0 ij = ∑d f=1 a f i b f j are values in Runknown before adding noise εij in equation (4.1) and ∆rij are the uncertainties calculated by the analytical algorithm. 22
  • 26. Figure 4.2: Comparison of unknown ratings and their predictions. The cloud of predictions and ratings without added noise is narrower, it means ˆr tries to eliminate the noise Figure 4.3: The pull distribution of normalized residuals rij−ˆrij σdata , rij ∈ Rknown. The standard devi- ation is ≈ 1, which means the optimization proce- dure found the minimum Figure 4.4: The pull distribution of normalized residuals r0 ij−ˆrij ∆rij , rij ∈ Runknown. The standard de- viation is slightly higher than 1, which means the uncertainties ∆rij are underestimated although just slightly It is important to note, that in the test the overall normalization factor σ is known, it is a standard deviation of the normal noise in equation (4.1). If the noise/fluctuations/measurement uncertainties are underestimated, the standard deviation of normalized residuals will be higher; if overestimated, the standard deviation of normalized residuals will be smaller than a standard deviation with correct estimate of the uncertainties. Though the pull distribution of the normalized residuals is normally distributed, it has standard deviation slightly higher than 1, it means the uncertainties on rij are underestimated although just by a few percent (figure 4.4). Boostrapping. If the pull distribution of normalized residuals is a near-standard normal, the uncertain- ties are estimated by the Bootstrapping method. If the analytical algorithm is correct, a scatter plot of their results should be straight line ∆rCALC = ∆rMC, where ∆rCALC and ∆rMC are the results of the an- 23
  • 27. alytical algorithm and the Bootstrapping respectively. A comparison of the uncertainties, computed by both methods is shown in figure 4.5. For low uncertainties, calculation and bootstrapping agree, which means that analytical solution is directly proportional to real uncertainties. However, higher uncertain- ties are underestimated by the calculation. That can be explained by the linear approximation used in the Maximum likelihood method and equation (3.12). Figure 4.5: Comparison of calculated ∆rij (rij ∈ Runknown) with the Bootstrapping results. The analytical algorithm underestimates higher uncertainties As expected, the uncertainties for unknown ratings are higher than ones for known ratings, because the model parameters are learned on known ratings (figure 4.6). Figure 4.6: Histogram of calculated ∆r. with the Bootstrapping results. The uncertainties for unknown ratings (test set) are higher than for known ratings (training set) Shown above results are just for one simulated rating matrix and particular chosen parameters. More tests with various parameters are shown further. 24
  • 28. 4.2 Test of various rating matrices and noise levels Comparison of the analytical algorithm and the Bootstrapping results for various matrix sizes and noise levels (figure 4.7) shows, that the analytical algorithm result converges better to the Bootstrapping result for larger rating matrices. It can be defended by the fact, that for larger matrices with the same sparsity there are in average more known values for each user and each item. The analytical algorithm results are more underestimated for higher noise levels, though the algo- rithm can estimate uncertainties even for a relatively high noise (see ratings distribution for d = 2 in figure 4.1). For the same matrix size and sparsity, but with higher number of latent factors dimensions, the analytical algorithm’s result is getting more unreliable (figure 4.8). With too many factors for highly sparse rating matrix, the uncertainties are too high, so it is impossible to make reliable predictions. Figure 4.7: Comparison of the analytical algorithm and the Bootstrapping results for unknown ratings for various matrix sizes and noise levels σ. From left to right: increasing noise level. From top to bottom: increasing matrix size. The analytical algorithm’s results are less underestimated for larger matrices 25
  • 29. Figure 4.8: Comparison of the analytical algorithm and the Bootstrapping results for unknown ratings for various number of latent factors dimensions d and sparsity β. From left to right: increasing d. From top to bottom: increasing β. The analytical algorithm’s results are getting less reliable for higher d and β 4.3 Test of various sparsities Test if the method works on sparse data is done on the same input matrix. In figures 4.9 and 4.10 a pull distribution of normalized residuals deviations for various sparsity of rating matrix R are shown for r−ˆr σdata and r0−ˆr ∆r respectively. The figures show the result of 20 simulations. From the first figure it may be concluded that for a very low number of known data points the model may overfit. The second figure says that for high sparsity of input, the method can not estimate uncertainties for known or unknown ratings properly. With higher sparsity, prediction uncertainties are increasing (figure 4.11). It is in line with expectations, because the fewer data are known, the higher are prediction uncertainties. For highly sparse rating matrices, the uncertainties on rating predictions become larger than the predictions themselves, which means the predictions are unreliable (figure 4.12). 26
  • 30. Figure 4.9: Normalized residuals r−ˆr σdata deviation on sparsity. For high sparsity the model fails to find the unknown ratings Figure 4.10: Normalized residuals r0−ˆr ∆r standard deviation on sparsity. For high sparsity of R the method can not estimate the uncertainties for known or unknown ratings properly 27
  • 31. Figure 4.11: Uncertainties distribution on sparsity. With higher sparsity, prediction uncertainties are increasing. Because the fewer data are known, the higher the prediction uncertainties are. However, for known ratings the uncertainties increase just slightly Figure 4.12: Distribution of the relative uncertainties ∆r/|ˆr|. For highly sparse rating matrices, the uncer- tainties on rating predictions become larger than the predictions themselves, which means the predictions become unreliable 28
  • 32. 4.4 Estimation of a normalization factor Parameter values used in the test are listed in table 4.4. Table 4.4: Parameter values Parameter Value n 30 m 40 d 2 ll, lr -1, 1 σdata 0.2 β 0.5 Method ALS Iterations 1000 Initialization std.dev. 0.1 If the uncertainty σdata is known, the goodness-of-fit given by the equation (3.14) can be tested. To get an estimate of χ2 Ndof a Monte Carlo simulation with 100 runs is executed. Ndata = n·m·(1−β) = 600 known data points. Nparams = (n+m)·d −d2 = 136 independent parameters (eq. (3.9)). Ndof = Ndata −Nparams = 466. Figure 4.13: Histogram of chi-square per degree of freedom. An estimate of chi-square per degree of freedom is approximately 1 In the test, as expected the average of the distribution is χ2 /Ndof 1 (figure 4.13). If the input noise σdata is unknown, instead of estimating χ2/Ndof in the MC simulation, the RSSs are calculated for each run. Then σdata can be estimated by the equation (3.15). Further, an average of 10 estimates are shown. In the test case there are 136 independent parameters, which means at least 137 of 1200 possible ratings must be known due to the requirement Ndata > Nparams (3.16), which corresponds to the maximum 88.6% sparsity of rating matrix (figure 4.14). 29
  • 33. Figure 4.14: Maximum sparsity of a rating matrix due to the requirement Ndata > Nparams If the number of degrees of freedom is low (not enough data points), the uncertainty estimate is not reliable. For high sparsity (low number of degrees of freedom) the normalization factor estimate is much higher than the input noise in data (figure 4.15). Hence, the uncertainties on rating predictions are high and predictions are not reliable. The level of sparsity when ˆσdata starts diverging, depends also on the level of input noise in data. The higher the input noise and the sparsity are, the higher the divergence of the estimate ˆσdata from actual σdata is. However, this level differs for different size of a rating matrix. For example, the same test for a larger rating matrix with the same number of factors (n = 60,m = 80,d = 2) shows that the method gives reliable noise level estimate even for a highly sparse rating matrix (figure 4.16). Opposite to the result above, the test on a rating matrix, initialized with higher number of factors (n = 30,m = 40,d = 4), shows that the model overfits and, hence, the noise level is underestimated (figure 4.17). Figure 4.15: Normalization factor estimate for various sparsity (n = 30,m = 40,d = 2). For high sparsity the estimate is much higher than the input noise in data Before it was assumed that the number of factors in the matrix factorization model dmodel is known priori and it is the number of factors in initializing the input rating matrix ddata. However, in reality d is an unknown parameter and, hence, should be found. Test for the wrongly chosen number of factors, when dmodel = ddata, shows that ˆσdata > σdata for dmodel < ddata. As expected, ˆσdata < σdata for dmodel > ddata due to overfitting, since the RSS in estimating σdata (see eq. (3.15)) is calculated only on known ratings. 30
  • 34. Figure 4.16: Normalization factor estimate for various sparsity (n = 60,m = 80,d = 2). The level of sparsity when the estimate starts diverging is higher for a larger rating matrix Figure 4.17: Normalization factor estimate for various sparsity (n = 30,m = 40,d = 4). The model overfits (too many factors for a small matrix) and, hence, the noise level is underestimated Figure 4.18: Normalization factor estimate for the wrongly chosen number of factors. ˆσdata > σdata for dmodel < ddata. ˆσdata < σdata for dmodel > ddata 31
  • 35. 4.4.1 Determining d To determine d cross-validation is used. It is a method for estimating a prediction accuracy of a model or to perform model selection [25]. The prediction accuracy is estimated for models with various d and the optimum of d, in which the minimum of prediction accuracy is given, is considered as the model’s number of factors dmodel. The method holds out part of available data as a validation set and takes the rest as a training set. The model is fit to the training set, and predictive accuracy is evaluated for the validation set. The split-train-evaluate cycle is repeated multiple times, and the estimated accuracy is derived by averaging the rounds. Depending on how data is split, there are different types of the cross-validation. The most popular one is k-fold cross-validation, where the dataset is partitioned into mutually exclusive k subsets of approxi- mately same size and at each round one subset is taken as the validation subset and other k − 1 subsets are taken as the training subset. However, for highly sparse input matrix the k-fold cross-validation might fail to give reliable accuracy estimate, since the number of rounds is fixed and equal k. Random subsampling cross-validation randomly splits the input dataset into training and validation subsets, with the same proportions of the dataset to include in the subsets at each round. Since a part of the input data is hold out as the validation set, a sparsity of a rating matrix increases to γ + (1 − γ)β, where γ is the proportion of the dataset to hold out. The test shows that with increasing sparsity and increasing input number of factors the method more often fails to find right ddata, because it means fewer data points and more parameters (figure 4.19). Figure 4.19: The results of 10 random k-fold cross-validations with k = 5. The middle graphs do not include results for dmodel = 5 due to the violation of the requirement (3.16). Solid lines show an accuracy assessed for the validation set, dashed lines - for the training set. 32
  • 36. 4.5 Conclusions As the result of the tests, the following statements about the analytical algorithm and the Bootstrapping method may be concluded. The Bootstrapping method: • The Bootstrapping method is slow – it needs a lot of computationally intensive runs to give accurate uncertainties of measurements. • With enough number of runs, the approximation can be considered as real due to the Bootstrapping method properties listed in a section 3.3. The analytical algorithm: • The analytical method is fast – the only computationally intensive part is calculating the inverse of precision matrix. There is no need to run matrix factorization multiple times as in the Bootstrap- ping method. • In the analytical method the overall normalization factor σdata in equations (3.5, 3.6, 3.7) is re- quired. One run of the model training gives a good estimate of the normalization factor, if the model training finds a global minimum. • The analytical algorithm’s results are less underestimated for larger matrices. • The analytical algorithm’s results are getting less reliable for lower Ndof and higher input noise. • For highly sparse rating matrices, the uncertainties on rating predictions become larger than the predictions themselves, which means the predictions are unreliable. The analysis of the goodness-of-fit shows that the method is implemented without programming errors and stable even for high sparsity. Considering that in real recommendation problems the ratings matrix is large and sparse (for instance, well known MovieLens 1M dataset’s sparsity is ≈ 95%) and that the analytical algorithm works on simulated large sparse matrices, the algorithm may be useful for estimating noise levels in real data. 33
  • 37. Chapter 5. Calculations on real data 5.1 Financial market data An input dataset is the transactions of emerging market bonds bonds request for quotes (RFQs). A bond is a debt investment in which an investor loans money to an entity (usually corporate or governmental) which borrows the funds for a set period of time at a variable or fixed interest rate. Bonds are issued by companies, municipalities and states to raise money and finance a variety of projects. Specification of the dataset used in the recommender system is provided in table 5.1. Table 5.1: Input data columns Column Description Trade date Date of sending the request for quote Customer The company or government who bought or sold the bond Issuer Bond issuer Bond The name of bond Buy/Sell Type of trading (buy or sell) Price The market price at trade date of a tradeable bond, repre- sented by a percentage of the bond’s par value Volume Number of bonds traded in this transaction Time to maturity The date on which the bond will mature, reach par Other parameters of bonds are not considered. For the sake of privacy protection, the data is anonymized. Additionally for convenience all customers, bonds and issuers are assigned individual numerical identi- fiers - natural numbers in ascending order. An overview of the data is given in table 5.2. Table 5.2: Input data values Column Values Trade date from 29/04/2015 to 29/05/2016 Customer 1400 unique customers Issuer 131 unique issuers Bond 414 unique bonds Buy/Sell 13880 buy RFQs, 13458 sell RFQs Price [51.75 %, 172.5 %] Volume [103,4.85·107] Time to maturity From 5 days to 30 years The data contains 27338 transactions over one year starting 29/04/2015 (figure 5.1). The blank period around the trade day 250 without any trade corresponds to holidays – twelve days from the December 23th till the January 3th inclusive. 34
  • 38. Figure 5.1: Transactions per day There are 1400 customers. The most trading customers, both by total traded volume and number of trades, are the five global inter-dealer brokers. The normalized histogram of a number of transactions per week of these companies shows that they started trading less this year, in comparison with last year (figure 5.2). Customers, trading a lot, buy and sell bonds almost equally. While for less trading cus- tomers, the correlation between numbers of buy and sell transactions is less (figure 5.3). 358 customers only buys, 283 customers only sells and others buy and sell. Figure 5.2: Distribution of transactions per week. The five most trading customers are inter-dealer brokers. They are trading less this year, in comparison with last year There are 131 issuers with total 414 bonds in the dataset. About half of the issuers have one bond and some have more than 20 different bonds (figure 5.4). Not all issuers’ bonds were traded in all quarters: some issuers’ bonds were not traded in 2016, some issuers’ bonds do not have transactions history in 2015 (figure 5.5). Time to maturity of traded bonds range from five days to 30 years (figure 5.6). Bonds with time to maturity within 10 years are most popular. The histograms of prices and volumes of trades are presented in figure 5.7. 35
  • 39. Figure 5.3: Comparison of customers’ buy and sell transactions. Customers, trading a lot, almost equally buy and sell. While for less trading customers, the correlation between numbers of buy and sell transac- tions is less Figure 5.4: The cumulative histogram of issuers’ bonds. About half of issuers have one bond and some have more than 20 different bonds Figure 5.5: Time when issuers’ bonds were traded. Not all issuers’ bonds were traded in all quarters 36
  • 40. Figure 5.6: Time to maturity of transactions. Bonds with time to maturity within 10 years are most popular Figure 5.7: Price and volume of trades 5.2 Rating matrix In matrix factorization, an input data should be represented as a matrix of ratings (rui) ∈ Rn×m, where rui is the rating given by customer u to item i. Items may be bonds or issuers. To have a rating matrix dense enough for matrix factorization, the temporal aggregation is applied with following periods: one month, one quarter (three months) and one year (table 5.3). If bonds are taken as items, the rating matrix is too sparse. For this reason, bonds issuers are taken as items. Table 5.3: Size and density of rating matrix depending on period Period (T) Time Customers Bonds Density Issuers Density 1 month February 2016 425 278 1.16% 81 3.22% 1 quarter October-December 2015 827 312 1.65% 88 4.05% 1 year May-2015-April 2016 1400 414 1.98% 131 3.82% The rating matrix for a period of one year, with issuers ordered by a number of customers traded their bonds, is shown in figure 5.8. Figures 5.9 and 5.10 show how many customers traded how many unique bonds from how many issuers within the entire period. 445 customers traded only one bond each, 957 37
  • 41. customers traded equal to or less than 5 bonds each. 60 bonds were traded by only one customer, 108 bonds were traded by equal to or less than 5 customers. Figure 5.8: Rating matrix for a period of one year. The issuers are ordered ascendingly by a number of customers traded their bonds, then customers are sorted by a number of issuers Figure 5.9: The cumulative histogram of unique bonds customers traded Figure 5.10: The cumulative histogram of unique customers per bond Since the rating matrix is highly sparse, densifying the matrix may increase accuracy of predictions. The straight-forward approach of densifying a rating matrix is to remove unpopular users and/or items, i.e. with a low number of known ratings. To make a rating matrix denser, an algorithm, which ensures that all users and items have at least l ∈ N number of entries in a rating matrix, is applied (algorithm 1). On each round the algorithm removes users and items from a rating matrix which have a number of entries in the rating matrix less than l. As the result, with increasing l, a number of customers and items in the rating matrix decrease drastically. The rating matrix becomes smaller and denser (tables 5.4, 5.5 and 5.6). Table 5.4: One month period rating matrix parameters for various entries limit l l Customers Issuers Elements Density, % 1 425 81 1108 3.22 2 198 71 873 6.21 3 118 57 693 10.30 4 71 47 526 15.76 5 45 35 380 24.13 6 21 21 197 44.67 38
  • 42. Algorithm 1 Removing users and items without enough known ratings Data: rating matrix (rui) ∈ Rn×m {rui - rating of a user u to an item i}, minimum number of entries l ∈ N. Result: reduced rating matrix. 1: repeat 2: U ←− set of users in R 3: I ←− set of items in R 4: for all user u ∈ U do 5: if number of known ratings R[u,:] < l then 6: R ←− remove corresponding row u in R 7: end if 8: end for 9: STOP = true 10: for all item i ∈ I do 11: if number of known ratings R[:,i] < l then 12: R ←− remove corresponding column i in R 13: STOP = false 14: end if 15: end for 16: until STOP Table 5.5: One quarter period rating matrix parameters for various entries limit l l Customers Issuers Elements Density, % 1 827 88 2951 4.05 2 449 76 2565 7.52 3 321 70 2299 10.23 4 230 65 2012 13.46 5 165 64 1748 16.55 6 125 59 1526 20.69 Table 5.6: One year period rating matrix parameters for various entries limit l l Customers Issuers Elements Density, % 1 1400 131 7014 3.82 2 910 95 6492 7.51 3 677 82 6007 10.82 4 523 80 5539 13.24 5 436 80 5191 14.88 6 362 75 4799 17.68 5.3 Implicit feedback As the data do not represent explicit “preferences” of customers towards bonds, “preferences” should be get from implicit feedback. Two ways of getting “preferences” from implicit feedback are applied. Preference to buy or sell. An item may be both sold and bought by a user. The total amount of user’s trades represents whether the user “likes” the item or not. “Preference” is chosen as the following: if a 39
  • 43. user buys an item more than sells or never sells, it is assumed that the user “likes” the item. And it is the other way around for “disliking” the item. The following aggregation procedure is applied. First, the amount of a trade is calculated as a multiplication of volume and price (in percentage) of the trade with a sign depending on the trade’s type: at = Volume·Price if “buy −Volume·Price if “sell Then for each pair of user u and item i the total traded amount is calculated within chosen period: rui = ∑t∈(u,i) at. pui = 1 if rui > 0 −1 if rui < 0 Cumulative amount of buy and sells of top 6 most traded pairs of customer and issuer is shown in figure 5.11. In come cases, there is a big difference between amount of buys and amount of sells, which may suggest a strong preference in a particular type of trading (buy or sell). Yet some customers tend to buy and sell the same issuer’s bonds equally (green and dark blue lines). Figure 5.11: Cumulative amount of buy and sells of top 5 most traded pairs of customer and issuer Preference to trade an issuer’s bonds. One item may be traded by the user multiple times and this frequency of transactions, with no regard to type of transactions, may be used as an indication of “pref- erence”. More frequent means stronger indication that the user likes the item. The “preference” may be calculated as: pui = log(1+nui), where nui is the number of transactions of item i by user u within a period. The logarithm is taken, since some user-item pairs have large number of transactions (see figures 5.12 and 5.13). 40
  • 44. Figure 5.12: Number of transactions per each user- item pair within all dates Figure 5.13: The binary logarithm of a number of transactions per each user-item pair within all dates 5.4 Factorization A main problem in matrix factorization is to choose the number of factors, d, for a given dataset. To determine d, k-folds cross-validation technique, described in section 4.4.1, is used. For each matrix the number of factors d should be chosen small enough to not violate the requirement (3.16), that the number of independent parameters should not exceed the number of known data points (section 3.1.2). As cross- validation hold out a part of a dataset as a validation subset S = ST ∪SV , the sparsity of a training subset ST is higher than the rating matrix sparsity β: βT = β + (1 − β)/k. Therefore, the maximum possible d may be less. As one month and one quarter rating matrices are small and sparse, 20-folds cross- validation is used for them. The optimum of d, in which the minimum of prediction accuracy is given, is considered as the model’s number of factors. The optimisation method is chosen MCMC, as it performs better than other methods (appendix A). Parameters for performing cross-validation are listed in table 5.7. Table 5.7: Cross-validation check parameters Parameter Value d {1···6} Method MCMC Iterations 1000 Initialization std.dev. 0.1 Number of folds, k 10, 20 5.4.1 Preference to buy or sell Prediction of preference to buy or sell is the binary classification task. The prediction accuracy of the binary classification is the proportion of correctly classified values: ACC = ∑True buy+∑True sell ∑Total population The classifier gives as predictions the probability estimates of the positive (buy) class. Normally the threshold is 0.5. The results of performing 10-folds cross-validations to validate classification with the threshold equal 0.5 are shown in figure 5.14. Limiting a number of entries applying the algorithm 1 on the matrices, does not have any effect on the accuracy. Even though the accuracy on the one month 41
  • 45. rating matrix for some cases is slightly higher than on other matrices, the accuracy estimate variance is high. The best accuracy with low variance is get on the one year period matrix and is equal to 57%. Sensitivity (true positive rate) and specificity (true negative rate) values suggest, that by changing the classification probability threshold, the accuracy may be improved. However, the distribution of the two classes predictions highly overlaps (figure 5.15). A receiver operating characteristic (ROC), which illustrates the performance of a binary classifier as its classification probability threshold is varied, shows that adjusting the threshold does not give improvements in the accuracy. The mean area under curve (AUC) for the validation sets is 0.59. It may be concluded that predicting buy and sell preferences of users towards items using the matrix factorization model, with the assumption that the amount of user’s trades represents user’s preference to buy or sell, is not reliable. Figure 5.14: Accuracy score, sensitivity and specificity for matrices with ratings as preference to buy or sell. Matrices’ parameters are listed in tables 5.4, 5.5 and 5.6. Some of the results for low limit of entries l do not include results for high d due to the violation of the requirement (3.16). Solid lines show an accuracy assessed for validation sets, dotted lines - for training sets. Accuracies are shown with a standard deviation interval. Higher accuracy is better 42
  • 46. Figure 5.15: The distribution of the buy and sell classes predictions in valida- tion sets, created from 10-folds cross- validation on T = 1 year, l = 1 matrix and d = 1 model. For other settings results are more or less the same Figure 5.16: ROC of validation sets, created from 10-folds cross-validation on T = 1 year, l = 1 matrix and d = 1 model. For other settings results are more or less the same 5.4.2 Preference to trade Predictions of preference to trade is the regression task. To compare errors on the datasets and the models, the prediction accuracy is chosen as normalized MSE (NMSE) – MSE divided by a mean square of targets. NMSES = ∑rij∈S rij − ˆrij 2 S 2 2 , (5.1) where S is a set of ratings. For the not “cleaned” one year period matrix (l = 1) and 10-cross validation, the maximum dmax = 4. The result of cross-validation gives an optimal dopt = 4 (figure 5.17). Figure 5.17: NMSE for the one year period rating matrix without limit of entries (T=1 year, l = 1). Dashed lines show one standard-deviation interval The results of performing 10 and 20 folds cross-validations on “cleaned” matrices, which parameters are listed in tables 5.4, 5.5 and 5.6, are shown in figure 5.18. Limiting a number of entries applying the algorithm 1 on the matrices, have different effects. In the one month period rating matrix, it decreases the accuracy. In the one quarter and the one year periods matrices, it does not have any considerable effect 43
  • 47. on the accuracy. The benefit is that it makes possible to use higher number of latent factors dimension d. The lowest error is achieved when T = 1 month, l = 1, d = 1. For the one quarter period matrix, an optimal dopt = 3. For the one year period matrix, dopt = 6. Unfortunately, the comparison of ratings and their predictions in validation sets shows that the predictions are quite off from real values. The results of 20-folds cross-validation of the model with d = 1 on the one month period rating matrix without limiting entries, 20-folds cross-validation of the model with d = 3 on the one quarter period matrix and 10-folds cross-validation of the model with d = 6 on the one year period matrix with limiting entries l = 3 are shown in figure 5.19. It may be concluded that it is impossible to make reliable predictions in the one month period matrix. To get an accuracy of individual predictions, uncertainties are calculated by the analytical algorithm (section 3.1). The algorithm requires point estimates of parameters. However, the optimization method MCMC samples parameters from the posterior distribution and gives final predictions as a mean of all predictions made during learning process. Therefore, chosen models are optimized again by SGD with learning rate η = 0.02 and regularization λ = 0.2. To get a test set, cross-validation is used. The pull distribution of normalized residuals is expected to be a standard normal: rij−ˆrij ∆rij , where rij are values in validation sets Runknown and ∆rij are the uncertainties calculated by the analytical algorithm. Values when rij = 1 and ˆrij = 1 are dropped off from the distribution, since the minimum rating in the train set is ≥ 1 and libFM has predictions’ borders: min rij∈Rtrain (rij) ≤ ˆr ≤ max rij∈Rtrain (rij). It means ˆrij = 1 does not show any preference and should be ignored. The pull distributions of normal- ized residuals show, that calculated errors are mostly correct. However, the distributions have a high peak around zero, which means there are many overestimated errors. The distributions are skewed right, because there are more underestimated rating predictions, than overestimated (figure 5.20). The uncer- tainties on rating predictions are mostly lower than the predictions for T = 1 quarter. For T = 1 year, there is a bigger portion of predictions with uncertainties higher than the predictions (figure 5.21). 5.5 Conclusions Two ways of getting preferences from the implicit feedback are applied: preference to buy or sell and preference to trade an issuer’s bonds, leading to the classification and the regression tasks respectively. Temporal aggregation of the data is applied with periods of one month, one quarter and one year. Den- sifying a rating matrix removing items and users with a low number of entries in the rating matrix is performed. Uncertainties are calculated by the analytical algorithm for the regression task. To validate models, k-folds cross-validation technique is used. Predicting buy and sell preferences of users towards items using the matrix factorization model, with the assumption that the amount of user’s trades represents user’s preference to buy or sell, seems to be unreliable. Densifying a rating matrix make possible using higher number of factors. However, it does not affect the predictions accuracy in a positive way, yet it removes predictions with high uncertainties. Temporal aggregation of the data with a period of one month results in a small sparse rating matrix, on which the factorization does not perform well. Results for the one quarter and one year periods are better. Taking into consideration calculated by the analytical algorithm uncertainties on the individual predictions possibly improves the recommendations. Even though there are many overestimated errors, the uncertainties on rating predictions are mostly lower than the predictions themselves. For one year period matrix, there is a bigger portion of predictions with uncertainties higher than the predictions. 44
  • 48. Figure 5.18: NMSE for all “cleaned” matrices with ratings as “preference to trade a bond”, which parameters are listed in tables 5.4, 5.5 and 5.6. Some of the results for low limit of entries l do not include results for high d due to the violation of the requirement (3.16). Solid lines show an accuracy assessed for validation sets, dotted lines - for training sets Figure 5.19: The comparison of target ratings and their predictions in the validation sets of cross- validation of the optimal models. Here, ratings are binary log of a number of transactions 45
  • 49. Figure 5.20: Pull distribution of normalized residuals rij−ˆrij ∆rij ,rij ∈ Runknown. The distributions are cropped in the limit {-5;5}. The normalized residuals standard deviation for T = 1 quarter, l = 1 matrix and d = 3 model (top left) is much higher than 1, because there are some highly underestimated uncertainties. The distributions have peak around zero, which means there are many overestimated errors. The distributions are skewed right, because the rating predictions are more underestimated, than overestimated Figure 5.21: Distribution of the relative uncertainties ∆rij/ˆrij,rij ∈ Runknown. For T = 1 quarter, the uncertainties on rating predictions are mostly lower than the predictions. For T = 1 year, there is a larger portion of predictions with uncertainties higher than the predictions 46
  • 50. Chapter 6. Conclusions and future work While recommender systems mostly give only overall accuracy using a test dataset, this approach is to give accuracies on all of the individual predicted preferences. The algorithm for estimating uncertainty on individual predictions for the basic matrix factorization model and the more general factorization machine is derived. On synthetic data the algorithm gives reliable estimations of uncertainties, showing that the algorithm is well understood. It was shown that for sparse input data, uncertainties are high. However, they may be estimated for highly sparse, large matrices. The algorithm is applied on a real data - the transactions of emerging market bonds requests for quotes (RFQs). A rating matrix for the matrix factorization model is built with various settings. Best results are obtained for ratings as preference to trade an issuer’s bonds and temporal aggregation with periods of one quarter and one year. For these matrices uncertainties are calculated by the analytical algorithm. Taking into consideration calculated by the analytical algorithm uncertainties on individual predictions possibly improves recommendations. It is shown that the accuracy of SGD and ALS methods highly depends on hyperparameters. In contradiction, MCMC automatically determines hyperparameters and performs significantly better than other examined methods both for dense and sparse matrices. However, MCMC should not be used in a couple with our algorithm, because the algorithm requires point estimates of parameters to calculate the uncertainties. Due to the time constraints, other ways of building a rating matrix from the dataset haven’t checked. Also, the power of the factorization machine to incorporate features is not exploited. For example, bond’s price and volume might be included into feature matrix. In the future other models, possibly more suitable for financial markets data, might be found and the algorithm for estimating uncertainty on individual predictions might be derived for them. 47
  • 51. Bibliography [1] S. Rendle, “Factorization machines with libfm,” ACM Trans. Intell. Syst. Technol., vol. 3, pp. 57:1– 57:22, May 2012. [2] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” Computer, vol. 42, no. 8, pp. 30–37, 2009. [3] G. Adomavicius and A. Tuzhilin, “Toward the next generation of recommender systems: A sur- vey of the state-of-the-art and possible extensions,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 6, pp. 734–749, 2005. [4] P. Lops, M. de Gemmis, and G. Semeraro, Recommender Systems Handbook, ch. Content-based Recommender Systems: State of the Art and Trends, pp. 73–105. Boston, MA: Springer US, 2011. [5] J. B. Schafer, D. Frankowski, J. Herlocker, and S. Sen, The Adaptive Web: Methods and Strategies of Web Personalization, ch. Collaborative Filtering Recommender Systems, pp. 291–324. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007. [6] J. Bennett, C. Elkan, B. Liu, P. Smyth, and D. Tikk, “Kdd cup and workshop 2007,” SIGKDD Explor. Newsl., vol. 9, pp. 51–52, Dec. 2007. [7] G. Linden, B. Smith, and J. York, “Amazon.com recommendations: Item-to-item collaborative filtering,” IEEE Internet Computing, vol. 7, no. 1, pp. 76–80, 2003. [8] Y. Hu, Y. Koren, and C. Volinsky, “Collaborative filtering for implicit feedback datasets,” in 2008 Eighth IEEE International Conference on Data Mining, pp. 263–272, Dec 2008. [9] J. L. Herlocker, J. A. Konstan, and J. Riedl, “Explaining collaborative filtering recommendations,” in Proceedings of the 2000 ACM Conference on Computer Supported Cooperative Work, CSCW ’00, (New York, NY, USA), pp. 241–250, ACM, 2000. [10] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl, “Evaluating collaborative filtering recommender systems,” ACM Trans. Inf. Syst., vol. 22, pp. 5–53, Jan. 2004. [11] G. Shani, L. Rokach, B. Shapira, S. Hadash, and M. Tangi, “Investigating confidence displays for top-n recommendations,” Journal of the American Society for Information Science and Technology, vol. 64, no. 12, pp. 2548–2563, 2013. [12] N. Rubens, M. Elahi, M. Sugiyama, and D. Kaplan, Active Learning in Recommender Systems, pp. 809–846. Boston, MA: Springer US, 2015. [13] A. Karatzoglou and M. Weimer, Quantile Matrix Factorization for Collaborative Filtering, pp. 253–264. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010. [14] A. Gunawardana and G. Shani, Evaluating Recommender Systems, pp. 265–308. Boston, MA: Springer US, 2015. 48
  • 52. [15] G. Tak´acs, I. Pil´aszy, B. N´emeth, and D. Tikk, “Investigation of various matrix factorization meth- ods for large recommender systems,” in 2008 IEEE International Conference on Data Mining Workshops, pp. 553–562, Dec 2008. [16] D. Agarwal and B.-C. Chen, “Regression-based latent factor models,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, (New York, NY, USA), pp. 19–28, ACM, 2009. 618092. [17] S. Rendle, “Learning recommender systems with adaptive regularization,” in Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, WSDM ’12, (New York, NY, USA), pp. 133–142, ACM, 2012. [18] D. Zachariah, M. Sundin, M. Jansson, and S. Chatterjee, “Alternating least-squares for low-rank matrix reconstruction,” IEEE Signal Processing Letters, vol. 19, pp. 231–234, April 2012. [19] S. Rendle, Z. Gantner, C. Freudenthaler, and L. Schmidt-Thieme, “Fast context-aware recommen- dations with factorization machines,” in Proceedings of the 34th International ACM SIGIR Confer- ence on Research and Development in Information Retrieval, SIGIR ’11, (New York, NY, USA), pp. 635–644, ACM, 2011. [20] T. Silbermann, I. Bayer, and S. Rendle, “Sample selection for mcmc-based recommender systems,” in Proceedings of the 7th ACM Conference on Recommender Systems, RecSys ’13, (New York, NY, USA), pp. 403–406, ACM, 2013. [21] M. Li, Z. Liu, A. J. Smola, and Y.-X. Wang, “Difacto: Distributed factorization machines,” in Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, WSDM ’16, (New York, NY, USA), pp. 377–386, ACM, 2016. [22] Y. Dodge, The Concise Encyclopedia of Statistics, ch. Maximum Likelihood, pp. 334–336. New York, NY: Springer New York, 2008. [23] F. James, “Interpretation of the shape of the likelihood function around its minimum,” Computer Physics Communications, vol. 20, no. 1, pp. 29 – 35, 1980. [24] Hayden and D. R. Twede, “Observations on the relationship between eigenvalues, instrument noise, and detection performance,” in Proceeding of SPIE 4816, Imaging Spectrometry VIII (S. S. Shen, ed.), p. 355, SPIE, Nov. 2002. [25] S. Arlot and A. Celisse, “A survey of cross-validation procedures for model selection,” Statist. Surv., vol. 4, pp. 40–79, 2010. 49
  • 53. Appendix A. Minimization methods test Methods’ hyperparameters are listed in table A.1. Table A.1: Hyperparameters Parameter Description SGD SGDA ALS MCMC η Learning rate + + - - λ Regularization values + - + - Initialization std.dev. The standard deviation for initialization + + + + Input rating matrices for tests are simulated as described in section 4.1 with parameter values listed in table A.2. Table A.2: Parameter values Parameter Value n 30 m 40 d 2 ll, lr -1, 1 σdata 0.2 β 0.7 βv 0.1 Iterations 1000 Initialization std.dev. 0.1 βv is the proportion of a training dataset to holdout as a validation subset in the SGDA method. Learning rate. SGD and SGDA are highly dependent of a learning rate value. Since SGDA may find regularization values automatically, the effect of the learning rate on optimization convergence is tested for SGDA method. The test shows that if the learning rate η is too small (η = 0.0001,0.001), the algorithm converges slowly; if η is high (η = 0.2,0.5), it may not find any minimum (figures A.1, A.2). For further tests the learning rate is chosen η = 0.01. 50
  • 54. Figure A.1: Convergence of SGDA with various learning rates. Results of model training on 10 simu- lated rating matrices of sparsity β = 0.7 (RMSE is calculated for a train dataset). SGDA highly depends on a learning step η: if η is too small (η = 0.0001,0.001), the algorithm converges slowly; if η is high (η = 0.2,0.5), it may not find any minimum 51
  • 55. Figure A.2: Convergence of SGDA with various learning rates. Results of model training on 10 simu- lated rating matrices of sparsity β = 0.7 (RMSE is calculated for a test dataset). SGDA highly depends on a learning step η: if η is too small (η = 0.0001,0.001), the algorithm converges slowly; if η is high (η = 0.2,0.5), it may not find any minimum 52
  • 56. Regularization values. The goal of using regularization is to generalize a model for enabling it to not just model known ratings but to predict unknown ratings as well. The regularization values are typically searched using grid search. As every parameter may have own regularization value, the grid search has exponential complexity. In the test, to reduce complexity the number of regularization parameters is reduced to one, i.e. all parameters have one regularization value. A measure of accuracy is taken as the root-mean-square error between predicted ratings ˆr and actual ratings without noise on top r0 in a test dataset RMSER0 test : RMSES = ∑rij∈S rij − ˆrij 2 |S| (A.1) There is no significant difference in accuracy of the methods on a dense matrix (β = 0.5), yet the high value of regularization prevents SGD from finding optimum and makes ALS more stable. On a rating matrix of high sparsity β = 0.8 MCMC performs significantly better than other methods; ALS and SGD without regularization give the worst accuracy, yet using regularization increases an accuracy. SGDA (SGD with adaptive regularization) performs as good as SGD with a proper chosen regularization parameter value. It makes SGDA more preferable than SGD, because it does not require grid search. However SGDA holds out part of a train dataset as a validation subset, which may make using SGDA impossible for a highly sparse input. Accuracies on a train and test datasets for a sparse matrix (β = 0.8) are presented in table A.3. The accuracy RMSER0 test is shown in figure A.3. See examples of methods result in figures A.4, A.5. MCMC performs significantly better than other examined methods both for dense and sparse matri- ces. 53