Marvin_Capstone

Like2Vec: A Graph Embedding Techinque for
Recommender Systems
Marvin Bertin – marvin.bertin@gmail.com
Michael Ulin - michaelulin@gmail.com
Mike Tamir – mntamir@gmail.com
Jacob Baumbach - jwbaum91@gmail.com
David Ott –davidott4@gmail.com
Data Science Master Program
GalvanizeU
San Francisco, USA
Abstract— User behavior datasets, such as the Netflix dataset, are
typically sparse and high dimensional. For this reason,
recommender systems tend to be based on matrix factorization
compression algorithms, instead of traditional statistical models.
Like2Vec proposes a novel approach to transform a sparse dataset
into a graph representation, followed by a neural embedding of the
nodes into a rich dense latent feature space. The distance between
these latent dimensions can be used to compute a similarity metric
between movies. These vector projections allow the model to
surface and recommend highly relevant movies for the respective
users. We show that Like2Vec outperforms standard baselines in
both the RMSE and Recall-at-N evaluation metric. Different
evaluation metrics lead to different hyper-parameter configuration.
We argue Recall-at-N is a superior metric for evaluating
recommender systems, since it provides a better assessment of the
quality of top-N recommendations.
Keywords—recommender system; neural networks; graphical
model; neural embedding; log-likelihood ratio; latent space
I. INTRODUCTION
Statistical models are among the most popular
techniques in machine learning for supervised predictive tasks.
Logistic regression and linear regression are still among the
most commonly used algorithms today. They have proven to
be robust and powerful models across many domains.
However, there is a type of data under which such
models regularly under-perform. When the data is high
dimensional and sparse, machine learning models suffer from
what is often referred as the curse of dimensionality.
When dimensionality increases, the volume of the
space increases so fast that the available data become sparse.
This scarcity is problematic for any method that requires
statistical significance. In order to obtain a statistically sound
and reliable result, the amount of data needed to support the
result often grows exponentially with the dimensionality.
Figure 1: Input graph on the left is embedded into continuous
vector space on the right.
Moreover, distances between data points loose their
significance, since all objects appear to be sparse and
dissimilar in many ways, which hinders statistical models
from learning meaningful patterns.
Unstructured text is a great example where statistical
models struggle. A text corpus can be represented as a
document-term frequency matrix. Such matrix typically has
dimensions in the order of 105
to 106
and is highly sparse.
Another example is user behavior data. Where both
the number of users and items can be very large, but any given
user will only ever interact with a tiny fraction of the item set
on average. Amazon purchase history or Netflix viewing
history are such type of data.
There are many ways to deal with such data, one of
which is to compress it into a lower dimensional rich feature
space, where the fall backs of the curse of dimensionality are
mitigated, allowing traditional statistical models to function
adequately.
In text, a very popular compression model is
Word2Vec, a neural network word embedding algorithm. It is
a type of neural language model that has been used to capture
the semantic and syntactic structure of human language [3],
and even logical analogies [4].
In the context of user-item matrices, a common
compression algorithm is matrix factorization. There exist

many different types of matrix factorization algorithm such as
PCA, SVD and non-negative matrix factorization.
These techniques decompose a high dimensional
matrix into its lower dimensional components, while
maintaining the underlining signal in the data. This
transformed space is sometime referred as the latent topic
space. This new space contains rich dense features that were
latent (hidden) in the original feature space.
Like2Vec takes inspiration from these compression
methods and combine it with a graphical representation to
build a new kind of recommender system, evaluated on the
famous Netflix dataset.
II. RECOMMENDER ALGORITHMS
Figure 2: Matrix Factorization
A. COLLABORATIVE ALGORITHMS
Most recommender systems are based on collaborative
filtering, where recommendations rely only on past user
behavior. In the Netflix dataset, used in our evaluation, such
information is in the form of rated user viewing history. There
are two primary approaches to collaborative filtering: the
neighborhood approach and the latent factor approach.
Neighborhood models represent the most common
approach. They are based on the similarity among either users
or items. For instance, two users are similar because they have
rated similarly the same set of items. Similarity between items
can also be deduced with the same logic.
Latent factor model approaches users and items as vectors
in the same ‘latent factor’ space by means of a reduced
number of hidden factors. In such a space, users and items are
directly comparable: the rating of user u on item i is predicted
by the proximity, for example inner-product, between the
related latent factor vectors.
One of the unique characteristic of the like2vec model is
that it incorporates elements of both approaches. The latent
factors are encoded in the dimensions of the dense vector
embedding and the neighborhood information is a by-product
of the SkipGram movie representation.
III. GRAPH EMBEDDING TECHINQUE
The scarcity of a graph representation is both a strength
and a weakness. Scarcity enables the design of efficient
discrete algorithms, but can make it harder to generalize in
statistical learning. Machine learning applications with graphs,
such as in here with movie recommendation [6] must be able
to deal with this scarcity.
Figure 3: DeepWalk
A. DEEPWALK
DeepWalk is an approach for learning latent
representations of vertices in a graph. These latent
representations extract meaningful and structural information
and encodes them in a continuous vector space, which is then
easily exploited by statistical models. DeepWalk can be
viewed as a generalization of Word2Vec embedding
representation.
Word2vec language models are composed of artificial
neural networks stacked to form an auto-encoder. These
models have proven particularly useful at compressing high
dimensional sparse representation of text into dense low
dimensional vectors. From just sequences of words in a
corpus, the training generates unsupervised features through
back-propagation of the neural network.
DeepWalk takes it a steps further than word2vec and
generates its own “corpus”. The graph is explored by a series
of truncated random walks. By treating the walks as the
equivalent of sentences and nodes as word, the algorithm is
able to generate arbitrary sized corpuses. This synthetic
language captures the community information present in the
graph. Traditional neural language models can then be used to
extract rich movie embeddings from the corpus.
DeepWalk’s latent representations has been evaluated on
several multi-label network classification tasks for social
networks such as BlogCatalog, Flickr, and YouTube. Results
show that DeepWalk outperforms challenging baselines,
especially in the presence of missing information. DeepWalk’s
representations can provide F1 scores up to 10% higher than
competing methods when labeled data is sparse. In some
experiments, DeepWalk’s representations are able to
outperform all baseline methods while using 60% less training
data.

DeepWalk is also scalable. It is an online learning
algorithm which builds useful incremental results, and is
trivially parallelizable. These qualities make it suitable for real
world task such as the Netflix dataset, which is challenging
high dimensional sparse dataset.
IV. LANGUAGE MODEL
Figure 4: SkipGram Language Model
The goal of language modeling is estimate the
likelihood of a specific sequence of words appearing in a
corpus. More formally, given a sequence of words
where wi ∈ V (V is the vocabulary), we would like to
maximize:
over all the training corpus.
DeepWalk generalizes such language modeling by
exploring the graph through a stream of short random walks.
These walks can be thought of short sentences and phrases in
a special language. The direct analog is to estimate the
likelihood of observing vertex vi given all the previous
vertices visited so far in the random walk.
A. SKIPGRAM
The language model used in like2vec is the SkipGram
algorithm. It maximizes the co-occurrence probability among
the words that appear within a window, w, in a sentence [7].
SkipGram language model has the following
characteristics:
• Instead of using the context to predict a missing
word, it uses one word to predict the context.
• The context is composed of the words appearing
to right side of the given word as well as the left
side.
• It removes the ordering constraint on the problem.
• The model is required to maximize the
probability of any word appearing in the context
without the knowledge of its offset from the given
word.
By generated synthetics sentences from a stream of
random walks and using the corpus as input to the SkipGram
language model, we’re able to build representations that
capture the shared similarities in local graph structure between
vertices. Vertices which have similar neighborhoods will
acquire similar representations, and allowing generalization on
machine learning tasks.
V. EVALUATION METRICS
The goal of the recommender system is to surface a
list of items which are the most relevant or appealing to a
specific user. This is referred to as a top-N recommendation
task. A common practice in industry and academia is to
evaluate recommender systems’ performance through error
metrics such as the RMSE (root mean squared error), which
capture the average error between the actual ratings and the
ratings predicted by the model. However, such evaluation
metric is not a natural fit for evaluating a top-N
recommendation task.
An extensive evaluation of several state-of-the art
recommender algorithms suggests that algorithms optimized
for minimizing RMSE do not necessarily perform as expected
and often do not translate into accuracy improvements.
Direct evaluation of top-N performance must be
accomplished by means of alternative methodologies based on
accuracy metrics, such as recall and precision. Accordingly,
this experiment will evaluate the Like2Vec model with both
the classical RMSE metric and the more appropriate method
called Recall-at-N.
A. RECALL-AT-N
Recall-at-N evaluation metric attempts to directly assess
the quality of top-N recommendations, in a way that RMSE
cannot.
The dataset with known ratings, is first split into two
subsets: training set M and test set T. The model is trained
with the ratings in M and then evaluated on the test set T. A
special characteristic of the test set is that it contains only 5-
stars ratings. The goal is to construct a test set that only
contains highly relevant items for the respective users.

In order to measure recall and precision, we perform the
following steps for each movie i rated 5-stars by user u in T:
i. From user u viewing history, surface the movie that
is most similar to movie i, based on the cosine
similarity of the embedded vectors.
ii. Randomly select 1000 additional movies unrated by
user u. The assumption is that most of these items
will not be of interest to user u.
iii. Compute the same similarity score for the additional
1000 movies.
iv. Generate a ranked list by ordering all the 1001
movies according to their predicted ratings. Let p
denote the rank of the test movie i within this list.
The best result corresponds to the case where the test
movie i precedes all the random items (i.e., p = 1).
v. We form a top-N recommendation list by picking the
N top ranked items from the list. If p ≤ N we have a
hit (i.e., the test item i is recommended to the user).
Otherwise we have a miss. Chances of hit increase
with N. When N = 1001 we always have a hit.
The computation of recall and precision proceeds as
follows. The overall recall and precision are defined by
averaging over all test cases:
where |T| is the number of test ratings. It is important to note
that Recall-at-N underestimate the computed recall and
precision with respect to true recall and precision. It must be
viewed as a lower-bound of the recommender system’s
performance. This stems from the hypothesis that all 1000
random movies are non-relevant to user u.
VI. DATA SET
A. NETFLIX DATA SET
The Netflix data set contains movie viewing behavior of
480,189 users. It is a record of 100,480,507 ratings from 0-5,
distributed across 17,770 movies. It is used to construct
models that predict user ratings, based on previous ratings
without any other information about the users or movies.
Predictions have been traditionally scored against the true
ratings in terms of root mean squared error (RMSE), and the
goal is to reduce this error as much as possible.
VII. IMPLEMENTATION
A. DATA PREPROCESSING & GRAPH BUILDING
A graph can be thought of as a representation of a sparse
square matrix, where the dimensions are each of the nodes and
the non-zero entries are the edges’ weight connecting the
nodes. The Netflix dataset is not in a graph format and need to
be transformed appropriately. In this paper, we explore two
ways to build a graph out of the user-item matrix, where each
non-zero entry is a rating between 0 and 5.
Covariance matrix: the first method takes the user-item
matrix and matrix multiple it with its transpose. The result is a
symmetric square matrix, where the non-zero entries represent
some measure of co-occurrence mututal information. This
matrix can then be represented as an un-directed graph.
Depending on the order of the matrix multiplication, the
nodes will either represent movies or users. In this exploration
phase of Like2Vec, only graphs with movies as nodes were
studied. Embedding movies into dense vectors reduced the
model’s dimensionality by several orders of magnitude
compared to doing the same with the users.
Log-likelihood matrix: In statistics, a log likelihood ratio
test is a statistical test used to compare the goodness of fit of
two models, one of which is a special case of the other . The
test is based on the likelihood ratio, which expresses how
many times more likely the data are under one model than the
other.
We used this idea to compute a score to analyze counts of
events that occur together. The log-likelihood ratio estimates
how many times more likely two items are to co-occur as
oppose to not. Its main advantage is that it corrects for
unbalanced occurrences of items. It’s a co-occurrence metric
that is weighted by the global occurrence of an item. In this
way, obscure movies are not drowned by other globally
popular items.
Another advantage is that it does not take into account the
ratings, which tend to be a noisy metric based on its highly
subjective nature. Moreover, this approach can also be used in
other domains where ratings are not available, for example a
user’s product purchase history.
A. RANDOM WALKS
Random walks are generated by picking a graph node at
random and traversing the network for 40 steps. The
probability of each next step is proportional to the weight on
the edges connected to the current node. The sequence
travelled is recorded and added to the “corpus”. This exact
same process is repeated multiple times for every node, in
order to fully explore the graph structure.
A. MOVIE EMBEDDING
Once the corpus of random walks is generated, it can then
be passed into the SkipGram language model. With an average
sliding window size of 6 items, the movie vectors are trained
multiple times with different embedding size.
A. EVALUATION
The evaluation is performed on two evaluation metrics.
The preferred metric for recommender systems, Recall-at-N,

and the traditional RMSE in order to compare the results with
a collaborative filtering baseline.
For the calculation of RMSE, the ratings first need to be
computed. For every movie in the test set, we picked the top-k
movies seen by the respective user that is most similar to the
movie being tested. The ratings are calculated by computing
either a naïve average or a weighted average (based on
similarity scores) of the top-k movies.
VIII.RESULTS
Figure 5: Recall-at-N (%)
A. RECALL-AT-N
Like2Vec’s Recall-at-N score was evaluated at different
values of N. Where N is the number of top-N recommendation
needed to surface the test set movie in question. Such analysis
gives an idea of the range of performance the recommender
system is capable off. The full behavior of the model is
visualized, which allows for a better comparsion between other
systems. In a commercial setting, an indepth understand of the
model permits optimal tuning of the algoritm.
Figure 6: Recall-at-N (Hit Freq.)
Figure 5 plots the recall score for both Like2Vec and a
baseline recommender. The baseline is essentially the same
model, but with the DeepWalk featurization removed. The
recall score is computed straight from the log-likelyhood ratio
as the similarity metric. Like2Vec clearly outperforms the
baseline for small N. This is a favorable behavior for a
recommender system. It means that like2vec is able to retrive
highly relevent five start movie very early in its ranked list.
There is a cross-over around rank 7, but this is alright if you
consider that most users rarely check out recommendations
passed rank 5 in most commercial domains.
Figure 6 plots the same results, but with recall being a
frequency count instead of a normalized percentage. Now it’s
even more clear, what makes like2vec so unique. If the
recommender system could sugest only one movie, it would
outperform the baseline more than twice. Like2Vec is
optimized to prioritize early sugestion of great movies at the
expense of poor long-tail behavior. This is highlighted here by
the fact that the curve is close to the y-axis.
Figure 7: Grid-Search on Recall Score
Hyper-parameter tuning through cross-validation was
performed on the embedding size and the number of random
walk generated per node. Figure 7 plots the grid search results
for the cross-over recall score. The cross-over recall score is
defined as the number of top-N recommendation possible
before like2vec stops outperforming the log-likelihood
baseline. Both vector embeddings of 300 and 500 dimenions
gave the best cross-over recall score. Although the larger
embedding was able to achieve this score with less random
walks per nodes.
A. RMSE
Figure 8: Baseline RMSE Score

The performance of Like2Vec was also evaluated with
RMSE on the predicted ratings. Predicitons were both made
with a naïve and weighted averaging. Figure 8 plot the result
for the baseline and Figure 9 for like2vec.
Figure 9: Like2Vec RMSE Score
Like2Vec outperformes the baseline in both predcition
schemes. Althought Like2Vec is optimized for recall
performance, it can still produce great perdictions in a RMSE
setting. The baseline suffered significant performance drop
when not using a weighted average for prediction. On the other
hand, Like2Vec performed almost identically in both
situations. Such behavior highlight the quality of the movie
embeddings. A projection of movie vectors along all the
embedded dimenions is enough to give a rich and accurate
similarity metric between the movies, allowing for a robust
rating prediction.
Figure 10: Grid-Search on RMSE Score
Hyper-parameter tuning through cross-validation was also
performed for the RMSE evaluation metric. Figure 10 plots the
grid search results. Note, the optimal performance of the model
are achieved at very different hyper-parameter configurations.
This should not come at a surprise, since as disscued earlier
RMSE and Recall-at-N do not test the same behavior.
Therefore tuning a recommender system with the wrong metric
will make for a sub-obtimal model.
A. COLLABORATIVE FILTERING
Another grid-search was performed for a second more
challenging baseline. ALS collaborative filtering is the go to
model for most recommender systems. Here again Like2Vec
outperforms ALS for all hyper-parameter configurations
tested.
IX. CONLUSION
In this paper, we introduce Like2Vec a novel approach to a
recommender system. We combine multiple machine learning
techniques to transform a high dimensional dataset into a
social graph. Community information was extracted from the
graph by borrowing ideas from neural language models,
producing dense latent representations. These latent
representations encode rich features in a lower dimensional
continuous vector space.
Like2vec resulted in high prediction rating peformance, as
well as, promising Recall-at-N results. It was shown that
Recall-at-N can force the model to prioritize surfacing highly
relevent content early in the ranking. Such behavior would be
penalized in an RMSE evaluation setting, but with Recall-at-N
it is instead promoted at the expense of the long tail
performance.
References
[1] R. Andersen, F. Chung, and K. Lang. Local graph partitioning using
pagerank vectors. In Foundations of Computer Science, 2006. FOCS’06.
47th Annual IEEE Symposium on, pages 475–486. IEEE, 2006.
[2] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A
review and new perspectives. 2013.
[3] Y. Bengio, R. Ducharme, and P. Vincent. A neural probabilistic
language model. Journal of Machine Learning Research, 3:1137–1155,
2003.
[4] L. Bottou. Stochastic gradient learning in neural networks. In
Proceedings of Neuro-Nˆımes 91, Nimes, France, 1991. EC2.
[5] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey.
ACM Computing Surveys (CSUR), 41(3):15, 2009.
[6] R. Collobert and J. Weston. A unified architecture for natural language
processing: Deep neural networks with multitask learning. In
Proceedings of the 25th international conference on Machine learning,
pages 160–167. ACM, 2008.
[7] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-
trained deep neural networks for large-vocabulary speech recognition.
Audio, Speech, and Language Processing, IEEE Transactions on,
20(1):30–42, 2012.
[8] R. Bambini, P. Cremonesi, and R. Turrin. Recommender Systems
Handbook, chapter A Recommender System for an IPTV Service
Provider: a Real Large-Scale Production Environment. Springer, 2010.
[9] P. Cremonesi, E. Lentini, M. Matteucci, and R. Turrin. An evaluation
methodology for recommender systems. 4th Int. Conf. on Automated
Solutions for Cross Media Content and Multi-channel Distribution,
pages 224–231, Nov 2008.
[10] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A
review and new perspectives. 2013.

Marvin_Capstone

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Marvin_Capstone

Similar to Marvin_Capstone (20)

Marvin_Capstone