Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neural Networks

Ask Me Any Rating: A Content-based
Recommender System based on
Recurrent Neural Networks
7th Italian Information Retrieval Workshop
Venezia (Italy), May 30-31 2016
Cataldo Musto, Claudio Greco, Alessandro Suglia and Giovanni Semeraro
Work supported by the IBM Faculty Award ”Deep Learning to boost Cognitive Question Answering”
Titan X GPU used for this research donated by the NVIDIA Corporation
1

Overview
1. Background
Content-based recommender systems
Neural network models
2. Research work
Ask me Any Rating (AMAR)
Experimental evaluation
3. Conclusions
Lesson-learnt
Vision
2

Content-based recommender systems
Consists in matching up the attributes of a user proﬁle with
the attributes of a content object (item) [1]
[1] P. Lops, M. De Gemmis, and G. Semeraro. “Content-based recommender systems:
State of the art and trends”. In: Recommender systems handbook. Springer, 2011
3

Deep learning
Deﬁnition
Allows computational models that are composed of
multiple processing layers to learn representations of data
with multiple levels of abstraction [2]
• Discovers intricate structure in large data sets by using the
backpropagation algorithm [3];
• Leads to progressively more abstract features at higher layers of
representations;
• More abstract concepts are generally invariant to most local
changes of the input;
[2] Y. LeCun, Y. Bengio, and G. Hinton. “Deep learning”. In: Nature 521 (2015)
[3] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. “Learning representations by
back-propagating errors”. In: Cognitive modeling (1988)
4

Recurrent Neural Networks
• Recurrent Neural Networks (RNN) are architectures suitable to
model variable-length sequential data [4];
• The connections between their units may contain loops which
let them consider past states in the learning process;
• Their roots are in the Dynamical System Theory in which the
following relation is true:
s(t)
= f(s(t−1)
; x(t)
; θ)
where s(t)
represents the current system state computed by a
generic function f evaluated on the previous state s(t−1)
, x(t)
represents the current input and θ are the network parameters.
[4] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations
by error propagation. Tech. rep. DTIC Document, 1985
5

RNN pros and cons
Pros
• Appropriate to represent sequential data;
• A versatile framework which can be applied to different tasks;
• Can learn short-term and long-term temporal dependencies.
Cons
• Vanishing/exploding gradient problem [5];
• Difficulties to reach satisfying minima during the optimization of
the loss function;
• Difficult to parallelize the training process.
[5] Y. Bengio, P. Simard, and P. Frasconi. “Learning long-term dependencies with
gradient descent is difficult”. In: Neural Networks, IEEE Transactions on 5 (1994)
6

Long Short Term Memory (LSTM)
• A speciﬁc RNN introduced to solve the vanishing/exploding
gradient problem;
• Each cell presents a complex structure which is more powerful
than simple RNN cells.
Figure: LSTM architecture [6]
forget gate (f) considers the
current input and the previous
state to remove or preserve the
most appropriate information for
the given task
[6] A. Graves, A. Mohamed, and G. Hinton. “Speech recognition with deep recurrent
neural networks”. In: Acoustics, Speech and Signal Processing (ICASSP), IEEE 2013
7

gradient problem;
input gate (i) considers the
current input and the previous
state to determine how the input
information will be used to
update the state cell
7

gradient problem;
output gate (o) considers the
current input, the previous state
and the updated state cell to
generate an appropriate output
for the given task
7

Ask Me Any Rating (AMAR)
“Mirror, mirror, here I stand.
What is the fairest movie in the
land?”
• Inspired by a neural network model
used to solve Question Answering
toy tasks [7];
• Name adapted from “Ask Me
Anything” [8];
• Very simple Factoid Question
Answering system where user
proﬁles are questions and ratings
are answers.
[7] J. Weston et al. “Towards AI-Complete Question Answering: A Set of Prerequisite
Toy Tasks”. In: CoRR abs/1502.05698 (2015)
[8] A. Kumar et al. “Ask Me Anything: Dynamic Memory Networks for Natural
Language Processing”. In: CoRR abs/1506.07285 (2015)
8

• Two different modules to
generate:
• User embedding
• Item embedding
• User embedding associated
to a user identiﬁer;
• Item embedding generated
from an item description;
• Concatenation of user and
item embeddings given to a
logistic regression layer to
predict the probability of a
“like”.
w1 w2 wm
User u Item description id
User LT Word LT
v(u)
v(id)
LSTM LSTM LSTM
v(w1) v(w2) v(wm)
h(w1) h(w2) h(wm)
Mean pooling layer
Concatenation layer
Logistic regression layer
9

User embedding
• An identifier u is associated to each user;
• The identifier is given as input to a lookup table (User LT);
• User LT converts it to a learnt user embedding v(u).
Item embedding
• Each word w1 . . . wm of the item description id is associated to a
unique identifier specific of the item descriptions corpus;
• Words identifiers are given as input to a lookup table (Item LT);
• Item LT converts them to learnt words embeddings v(wk);
• Words embeddings v(wk) are sequentially passed through an
RNN with LSTM cells (LSTM module);
• The LSTM module generates a latent representation h(wk) for
each word;
• A mean pooling layer averages the words representations
generating an item embedding v(id) for the item i.
10

“Like” probability estimation
• Item and user embeddings, v(id) and v(u), are concatenated in a
single representation;
• The resulting representation is used as feature for the
prediction task;
• A Logistic regression layer is used to estimate the probability of
a “like” given by user u to a speciﬁc item i;
• The generated score is used to build a sorted list of
recommended items for user u.
Optimization criterion
• The neural network is trained minimizing the Binary
Cross-entropy loss function.
11

AMAR extended
• AMAR extended adds to the AMAR
architecture an additional module
for items genres;
• An identiﬁer gk is associated to each
item genre;
• Genres identiﬁer are given as input
to a lookup table (Genre LT);
• Genres LT converts them to learnt
genres embeddings v(gk);
• A mean pooling layer averages the
genres representations generating a
genres embedding v(ig).
g1 g2 gn
Item genres igj
Genre LT
v(u) v(id)
Mean pooling layer
Concatenation layer
Logistic regression layer
v(g1) v(g2) v(gn)
v(ig)
Item description idUser u
12

Experimental protocol
• Datasets: Movielens 1M (ML1M) e DBbook;
• Text preprocessing: tokenization and stopword removal;
• Evaluation strategy: 5-fold cross validation for Movielens 1M,
holdout for DBbook;
• Recommendation task: top-N recommendation leveraging
binary user feedback;
• Evaluation strategy for recommendation: TestRatings [9];
• Metric: F1-measure evaluated at 5, 10 and 15.
[9] A. Bellogin, P. Castells, and I. Cantador. “Precision-oriented evaluation of
recommender systems: an algorithmic comparison”. In: Proceedings of the ﬁfth ACM
conference on Recommender systems. 2011
13

ML1M
A ﬁlm dataset created by the research group GroupLens of the
University of Minnesota which contains user ratings on a 5-stars
scale.
Each rating has been binarized according to the following formula:
bin_rating(r) =
{
1, if r ≥ 4
0, otherwise
#ratings 1000209
#users 6040
#item 3301
avg ratings per user 31.423
avg positive ratings per user 17.985
avg negative ratings per user 13.439
sparsity 0.95
14

DBbook
A book dataset released for the Linked open data-enabled
recommender systems: ESWC 2014 challenge [10].
It contains binary user preferences (e.g., I like it, I don’t like it).
#ratings 72371
#users 6181
#item 8170
avg ratings per user 11.392
avg positive ratings per user 6.727
avg negative ratings per user 4.665
sparsity 0.998
[10] T. Di Noia, I. Cantador, and V. C. Ostuni. “Linked open data-enabled
recommender systems: ESWC 2014 challenge on book recommendation”.
In: Semantic Web Evaluation Challenge. Springer, 2014
15

Models conﬁgurations
Embedding-based recommenders
W2V Google News (W2V-news)
• Method: SG
• Embedding size: 300
• Corpus: Google News
GloVe
• Embedding size: 300
• Corpus: Wikipedia 2014 +
Gigaword 5
Baseline recommenders
Item to item CF (I2I) *
• Neighbours: 30, 50, 80
User to user CF (U2U) *
• Neighbours: 30, 50, 80
SLIM with BPR-Opt (BPRSlim) *
TF-IDF
Bayesian Personalized Ranking
Matrix Factorization (BPRMF) *
• Latent factors: 10, 30, 50
Weighted Matrix Factorization
Method (WRMF) *
• Latent factors: 10, 30, 50
* MyMediaLite implementations 16

Models conﬁgurations
AMAR
• Opt. method: RMSprop [11]
• α: 0.9
• Learning rate: 0.001
• Epochs: 25;
• User embedding size: 10;
• Item embedding size: 10;
• LSTM output size: 10;
• Batch size:
• ML1M: 1536
• DBbook: 512
AMAR extended
• Opt. method: RMSprop
• α: 0.9
• Learning rate: 0.001
• Epochs: 25;
• User embedding size: 10;
• Item embedding size: 10;
• Genre embedding size: 10;
• LSTM output size: 10;
• Batch size:
• ML1M: 1536
• DBbook: 512
[11] T. Tieleman and G. E. Hinton. “rmsprop”. In: COURSERA: Neural Networks for
Machine Learning Lecture 6.5 (2012) 17

DBbook results
0.662 0.662
0.655
0.656
0.64
0.639
0.631
0.636
0.632
0.662
0.62
0.62
0.63
0.63
0.64
0.64
0.65
0.65
0.66
0.66
0.67
AMAR AMAR
extended
GloVe W2V-News I2I-30 U2U-30 BPRMF-30 WRMF-50 BPRSlim TF-IDF
F1@10
RECOMMENDER CONFIGURATIONS
Differences statistically signiﬁcant according to Wilcoxon test (ρ ≤ 0.05)
18

ML1M results
0.641 0.644
0.575
0.587
0.527 0.525 0.524 0.525
0.548
0.59
0.40
0.45
0.50
0.55
0.60
0.65
AMAR AMAR extended GloVe W2V-News I2I-30 U2U-30 BPRMF-30 WRMF-50 BPRSlim TF-IDF
F1@10
RECOMMENDER CONFIGURATIONS
Only differences between U2U and GloVe, BPRSlim and GloVe, GloVe and Word2vec
are not statistically signiﬁcant according to Wilcoxon test (ρ ≤ 0.05)
19

AMAR pros and cons
Pros
• High improvement on ML1M;
• Able to learn more suitable item and user representations for
the recommendation task;
• Item and user embeddings are not generated using a simple
mean, but they are adapted during training.
Cons
• It does not deal well with very sparse datasets:
• Small improvement on DBbook
• High training times:
• DBbook: 50 minutes per epoch
• ML1M: 90 minutes per epoch
20

AMAR Improvements
Optimization
• Use alternative training methods and regularization techniques;
• Use pretrained word embeddings;
• More appropriate cost functions for top-N recommendation;
• Increase embedding dimensions.
Architecture
• Item modeling may be improved by using different neural
network architectures;
• Classiﬁcation step may be done by using deeper fully connected
layers.
Additional features
Leverage important data silos to enrich item representations:
• Linked Open Data;
• Web and social media. 21

Thanks for your attention
• Design of recommender systems
using deep neural networks;
• Experimental evaluation on
well-known datasets on the
top-N recommendation task;
• Higher performance using deep
models than using shallow
models.
Alessandro Suglia
alessandro.suglia@gmail.com
Claudio Greco
claudiogaetanogreco@gmail.com
22

Technical details
(Warning: for geeks only)

Cross entropy
Deﬁnition
Given two probability distributions over the same underlying set of
events, p and q, it measures the average number of bits needed to
identify an event drawn from a set of possibilities, if a coding
scheme is used based on an “unnatural” probability distribution q,
rather than the “true” distribution p.
Given discrete probability distributions p and q, the cross entropy is
deﬁned as follows:
H(p, q) = −
∑
x
p(x) log q(x)
23

RNN
Given an input vector x(t)
, bias vectors b, c and weight matrices U, V
and W, a forward step of an RNN neural network is computed in this
way:
at = b + Wst−1 + Uxt
st = tanh(at)
ot = c + Vst
pt = softmax(ot)
In this case, the activation function are the hyperbolic tangent
(tanh) for the hidden layer and the multinomial logistic function
(softmax) for the output layer.
24

LSTM
The information ﬂow in an LSTM module is much more complex than
the one in an RNN. The architecture used in this work uses the
following equations, presented in [6]:
it = σ(Wxixt + Whiht−1 + Wcict−1 + bi)
ft = σ(Wxfxt + Whfht−1 + Wcfct−1 + bf)
ct = ftct−1 + it tanh(Wxcxt + Whcht−1 + bc)
ot = σ(Wxoxt + Whoht−1 + Wcoct + bo)
ht = ot tanh(ct)
where σ is the logistic sigmoid function, and i, f, o and c are
respectively the input gate, forget gate,output gate and cell
activation vectors, all of which are the same size as the hidden
vector h.
25

Corpus stats
Google News
• # tokens: 6B
• Vocabulary size: 40K
• # matched words:
• DBbook: 44636 (41.52%)
• ML1M: 35150 (49.13%)
GloVe
• # tokens: 100B
• Vocabulary size: 3M
• # matched words:
• DBbook: 65013 (60.48%)
• ML1M: 49893 (69.74%)
26

[1] Pasquale Lops, Marco De Gemmis, and Giovanni Semeraro.
“Content-based recommender systems: State of the art and
trends”. In: Recommender systems handbook. Springer, 2011.
[2] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep
learning”. In: Nature 521 (2015).
[3] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams.
“Learning representations by back-propagating errors”. In:
Cognitive modeling 5 (1988).
[4] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams.
Learning internal representations by error propagation.
Tech. rep. DTIC Document, 1985.
[5] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. “Learning
long-term dependencies with gradient descent is difﬁcult”. In:
Neural Networks, IEEE Transactions on 5 (1994).
[6] Alan Graves, Abdel-rahman Mohamed, and Geoffrey Hinton.
“Speech recognition with deep recurrent neural networks”. In:
26

Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE
International Conference on. IEEE. 2013.
[7] Jason Weston et al. “Towards AI-Complete Question Answering:
A Set of Prerequisite Toy Tasks”. In: CoRR abs/1502.05698 (2015).
[8] Ankit Kumar et al. “Ask Me Anything: Dynamic Memory
Networks for Natural Language Processing”. In: CoRR
abs/1506.07285 (2015).
[9] Alejandro Bellogin, Pablo Castells, and Ivan Cantador.
“Precision-oriented evaluation of recommender systems: an
algorithmic comparison”. In: Proceedings of the ﬁfth ACM
conference on Recommender systems. 2011.
[10] Tommaso Di Noia, Iván Cantador, and Vito Claudio Ostuni.
“Linked open data-enabled recommender systems: ESWC 2014
challenge on book recommendation”. In: Semantic Web
Evaluation Challenge. Springer, 2014.
26

[11] Tijmen Tieleman and Geoffrey E. Hinton. “rmsprop”. In:
COURSERA: Neural Networks for Machine Learning Lecture 6.5
(2012).
26

Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neural Networks

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (13)

Similar to Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neural Networks

Similar to Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neural Networks (20)

Recently uploaded

Recently uploaded (20)

Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neural Networks