SlideShare a Scribd company logo
1 of 6
Download to read offline
Like2Vec: A Graph Embedding Techinque for
Recommender Systems
Marvin Bertin – marvin.bertin@gmail.com
Michael Ulin - michaelulin@gmail.com
Mike Tamir – mntamir@gmail.com
Jacob Baumbach - jwbaum91@gmail.com
David Ott –davidott4@gmail.com
Data Science Master Program
GalvanizeU
San Francisco, USA
Abstract— User behavior datasets, such as the Netflix dataset, are
typically sparse and high dimensional. For this reason,
recommender systems tend to be based on matrix factorization
compression algorithms, instead of traditional statistical models.
Like2Vec proposes a novel approach to transform a sparse dataset
into a graph representation, followed by a neural embedding of the
nodes into a rich dense latent feature space. The distance between
these latent dimensions can be used to compute a similarity metric
between movies. These vector projections allow the model to
surface and recommend highly relevant movies for the respective
users. We show that Like2Vec outperforms standard baselines in
both the RMSE and Recall-at-N evaluation metric. Different
evaluation metrics lead to different hyper-parameter configuration.
We argue Recall-at-N is a superior metric for evaluating
recommender systems, since it provides a better assessment of the
quality of top-N recommendations.
Keywords—recommender system; neural networks; graphical
model; neural embedding; log-likelihood ratio; latent space
I. INTRODUCTION
Statistical models are among the most popular
techniques in machine learning for supervised predictive tasks.
Logistic regression and linear regression are still among the
most commonly used algorithms today. They have proven to
be robust and powerful models across many domains.
However, there is a type of data under which such
models regularly under-perform. When the data is high
dimensional and sparse, machine learning models suffer from
what is often referred as the curse of dimensionality.
When dimensionality increases, the volume of the
space increases so fast that the available data become sparse.
This scarcity is problematic for any method that requires
statistical significance. In order to obtain a statistically sound
and reliable result, the amount of data needed to support the
result often grows exponentially with the dimensionality.
Figure 1: Input graph on the left is embedded into continuous
vector space on the right.
Moreover, distances between data points loose their
significance, since all objects appear to be sparse and
dissimilar in many ways, which hinders statistical models
from learning meaningful patterns.
Unstructured text is a great example where statistical
models struggle. A text corpus can be represented as a
document-term frequency matrix. Such matrix typically has
dimensions in the order of 105
to 106
and is highly sparse.
Another example is user behavior data. Where both
the number of users and items can be very large, but any given
user will only ever interact with a tiny fraction of the item set
on average. Amazon purchase history or Netflix viewing
history are such type of data.
There are many ways to deal with such data, one of
which is to compress it into a lower dimensional rich feature
space, where the fall backs of the curse of dimensionality are
mitigated, allowing traditional statistical models to function
adequately.
In text, a very popular compression model is
Word2Vec, a neural network word embedding algorithm. It is
a type of neural language model that has been used to capture
the semantic and syntactic structure of human language [3],
and even logical analogies [4].
In the context of user-item matrices, a common
compression algorithm is matrix factorization. There exist
many different types of matrix factorization algorithm such as
PCA, SVD and non-negative matrix factorization.
These techniques decompose a high dimensional
matrix into its lower dimensional components, while
maintaining the underlining signal in the data. This
transformed space is sometime referred as the latent topic
space. This new space contains rich dense features that were
latent (hidden) in the original feature space.
Like2Vec takes inspiration from these compression
methods and combine it with a graphical representation to
build a new kind of recommender system, evaluated on the
famous Netflix dataset.
II. RECOMMENDER ALGORITHMS
Figure 2: Matrix Factorization
A. COLLABORATIVE ALGORITHMS
Most recommender systems are based on collaborative
filtering, where recommendations rely only on past user
behavior. In the Netflix dataset, used in our evaluation, such
information is in the form of rated user viewing history. There
are two primary approaches to collaborative filtering: the
neighborhood approach and the latent factor approach.
Neighborhood models represent the most common
approach. They are based on the similarity among either users
or items. For instance, two users are similar because they have
rated similarly the same set of items. Similarity between items
can also be deduced with the same logic.
Latent factor model approaches users and items as vectors
in the same ‘latent factor’ space by means of a reduced
number of hidden factors. In such a space, users and items are
directly comparable: the rating of user u on item i is predicted
by the proximity, for example inner-product, between the
related latent factor vectors.
One of the unique characteristic of the like2vec model is
that it incorporates elements of both approaches. The latent
factors are encoded in the dimensions of the dense vector
embedding and the neighborhood information is a by-product
of the SkipGram movie representation.
III. GRAPH EMBEDDING TECHINQUE
The scarcity of a graph representation is both a strength
and a weakness. Scarcity enables the design of efficient
discrete algorithms, but can make it harder to generalize in
statistical learning. Machine learning applications with graphs,
such as in here with movie recommendation [6] must be able
to deal with this scarcity.
Figure 3: DeepWalk
A. DEEPWALK
DeepWalk is an approach for learning latent
representations of vertices in a graph. These latent
representations extract meaningful and structural information
and encodes them in a continuous vector space, which is then
easily exploited by statistical models. DeepWalk can be
viewed as a generalization of Word2Vec embedding
representation.
Word2vec language models are composed of artificial
neural networks stacked to form an auto-encoder. These
models have proven particularly useful at compressing high
dimensional sparse representation of text into dense low
dimensional vectors. From just sequences of words in a
corpus, the training generates unsupervised features through
back-propagation of the neural network.
DeepWalk takes it a steps further than word2vec and
generates its own “corpus”. The graph is explored by a series
of truncated random walks. By treating the walks as the
equivalent of sentences and nodes as word, the algorithm is
able to generate arbitrary sized corpuses. This synthetic
language captures the community information present in the
graph. Traditional neural language models can then be used to
extract rich movie embeddings from the corpus.
DeepWalk’s latent representations has been evaluated on
several multi-label network classification tasks for social
networks such as BlogCatalog, Flickr, and YouTube. Results
show that DeepWalk outperforms challenging baselines,
especially in the presence of missing information. DeepWalk’s
representations can provide F1 scores up to 10% higher than
competing methods when labeled data is sparse. In some
experiments, DeepWalk’s representations are able to
outperform all baseline methods while using 60% less training
data.
DeepWalk is also scalable. It is an online learning
algorithm which builds useful incremental results, and is
trivially parallelizable. These qualities make it suitable for real
world task such as the Netflix dataset, which is challenging
high dimensional sparse dataset.
IV. LANGUAGE MODEL
Figure 4: SkipGram Language Model
The goal of language modeling is estimate the
likelihood of a specific sequence of words appearing in a
corpus. More formally, given a sequence of words
where wi ∈ V (V is the vocabulary), we would like to
maximize:
over all the training corpus.
DeepWalk generalizes such language modeling by
exploring the graph through a stream of short random walks.
These walks can be thought of short sentences and phrases in
a special language. The direct analog is to estimate the
likelihood of observing vertex vi given all the previous
vertices visited so far in the random walk.
A. SKIPGRAM
The language model used in like2vec is the SkipGram
algorithm. It maximizes the co-occurrence probability among
the words that appear within a window, w, in a sentence [7].
SkipGram language model has the following
characteristics:
• Instead of using the context to predict a missing
word, it uses one word to predict the context.
• The context is composed of the words appearing
to right side of the given word as well as the left
side.
• It removes the ordering constraint on the problem.
• The model is required to maximize the
probability of any word appearing in the context
without the knowledge of its offset from the given
word.
By generated synthetics sentences from a stream of
random walks and using the corpus as input to the SkipGram
language model, we’re able to build representations that
capture the shared similarities in local graph structure between
vertices. Vertices which have similar neighborhoods will
acquire similar representations, and allowing generalization on
machine learning tasks.
V. EVALUATION METRICS
The goal of the recommender system is to surface a
list of items which are the most relevant or appealing to a
specific user. This is referred to as a top-N recommendation
task. A common practice in industry and academia is to
evaluate recommender systems’ performance through error
metrics such as the RMSE (root mean squared error), which
capture the average error between the actual ratings and the
ratings predicted by the model. However, such evaluation
metric is not a natural fit for evaluating a top-N
recommendation task.
An extensive evaluation of several state-of-the art
recommender algorithms suggests that algorithms optimized
for minimizing RMSE do not necessarily perform as expected
and often do not translate into accuracy improvements.
Direct evaluation of top-N performance must be
accomplished by means of alternative methodologies based on
accuracy metrics, such as recall and precision. Accordingly,
this experiment will evaluate the Like2Vec model with both
the classical RMSE metric and the more appropriate method
called Recall-at-N.
A. RECALL-AT-N
Recall-at-N evaluation metric attempts to directly assess
the quality of top-N recommendations, in a way that RMSE
cannot.
The dataset with known ratings, is first split into two
subsets: training set M and test set T. The model is trained
with the ratings in M and then evaluated on the test set T. A
special characteristic of the test set is that it contains only 5-
stars ratings. The goal is to construct a test set that only
contains highly relevant items for the respective users.
In order to measure recall and precision, we perform the
following steps for each movie i rated 5-stars by user u in T:
i. From user u viewing history, surface the movie that
is most similar to movie i, based on the cosine
similarity of the embedded vectors.
ii. Randomly select 1000 additional movies unrated by
user u. The assumption is that most of these items
will not be of interest to user u.
iii. Compute the same similarity score for the additional
1000 movies.
iv. Generate a ranked list by ordering all the 1001
movies according to their predicted ratings. Let p
denote the rank of the test movie i within this list.
The best result corresponds to the case where the test
movie i precedes all the random items (i.e., p = 1).
v. We form a top-N recommendation list by picking the
N top ranked items from the list. If p ≤ N we have a
hit (i.e., the test item i is recommended to the user).
Otherwise we have a miss. Chances of hit increase
with N. When N = 1001 we always have a hit.
The computation of recall and precision proceeds as
follows. The overall recall and precision are defined by
averaging over all test cases:
where |T| is the number of test ratings. It is important to note
that Recall-at-N underestimate the computed recall and
precision with respect to true recall and precision. It must be
viewed as a lower-bound of the recommender system’s
performance. This stems from the hypothesis that all 1000
random movies are non-relevant to user u.
VI. DATA SET
A. NETFLIX DATA SET
The Netflix data set contains movie viewing behavior of
480,189 users. It is a record of 100,480,507 ratings from 0-5,
distributed across 17,770 movies. It is used to construct
models that predict user ratings, based on previous ratings
without any other information about the users or movies.
Predictions have been traditionally scored against the true
ratings in terms of root mean squared error (RMSE), and the
goal is to reduce this error as much as possible.
VII. IMPLEMENTATION
A. DATA PREPROCESSING & GRAPH BUILDING
A graph can be thought of as a representation of a sparse
square matrix, where the dimensions are each of the nodes and
the non-zero entries are the edges’ weight connecting the
nodes. The Netflix dataset is not in a graph format and need to
be transformed appropriately. In this paper, we explore two
ways to build a graph out of the user-item matrix, where each
non-zero entry is a rating between 0 and 5.
Covariance matrix: the first method takes the user-item
matrix and matrix multiple it with its transpose. The result is a
symmetric square matrix, where the non-zero entries represent
some measure of co-occurrence mututal information. This
matrix can then be represented as an un-directed graph.
Depending on the order of the matrix multiplication, the
nodes will either represent movies or users. In this exploration
phase of Like2Vec, only graphs with movies as nodes were
studied. Embedding movies into dense vectors reduced the
model’s dimensionality by several orders of magnitude
compared to doing the same with the users.
Log-likelihood matrix: In statistics, a log likelihood ratio
test is a statistical test used to compare the goodness of fit of
two models, one of which is a special case of the other . The
test is based on the likelihood ratio, which expresses how
many times more likely the data are under one model than the
other.
We used this idea to compute a score to analyze counts of
events that occur together. The log-likelihood ratio estimates
how many times more likely two items are to co-occur as
oppose to not. Its main advantage is that it corrects for
unbalanced occurrences of items. It’s a co-occurrence metric
that is weighted by the global occurrence of an item. In this
way, obscure movies are not drowned by other globally
popular items.
Another advantage is that it does not take into account the
ratings, which tend to be a noisy metric based on its highly
subjective nature. Moreover, this approach can also be used in
other domains where ratings are not available, for example a
user’s product purchase history.
A. RANDOM WALKS
Random walks are generated by picking a graph node at
random and traversing the network for 40 steps. The
probability of each next step is proportional to the weight on
the edges connected to the current node. The sequence
travelled is recorded and added to the “corpus”. This exact
same process is repeated multiple times for every node, in
order to fully explore the graph structure.
A. MOVIE EMBEDDING
Once the corpus of random walks is generated, it can then
be passed into the SkipGram language model. With an average
sliding window size of 6 items, the movie vectors are trained
multiple times with different embedding size.
A. EVALUATION
The evaluation is performed on two evaluation metrics.
The preferred metric for recommender systems, Recall-at-N,
and the traditional RMSE in order to compare the results with
a collaborative filtering baseline.
For the calculation of RMSE, the ratings first need to be
computed. For every movie in the test set, we picked the top-k
movies seen by the respective user that is most similar to the
movie being tested. The ratings are calculated by computing
either a naïve average or a weighted average (based on
similarity scores) of the top-k movies.
VIII.RESULTS
Figure 5: Recall-at-N (%)
A. RECALL-AT-N
Like2Vec’s Recall-at-N score was evaluated at different
values of N. Where N is the number of top-N recommendation
needed to surface the test set movie in question. Such analysis
gives an idea of the range of performance the recommender
system is capable off. The full behavior of the model is
visualized, which allows for a better comparsion between other
systems. In a commercial setting, an indepth understand of the
model permits optimal tuning of the algoritm.
Figure 6: Recall-at-N (Hit Freq.)
Figure 5 plots the recall score for both Like2Vec and a
baseline recommender. The baseline is essentially the same
model, but with the DeepWalk featurization removed. The
recall score is computed straight from the log-likelyhood ratio
as the similarity metric. Like2Vec clearly outperforms the
baseline for small N. This is a favorable behavior for a
recommender system. It means that like2vec is able to retrive
highly relevent five start movie very early in its ranked list.
There is a cross-over around rank 7, but this is alright if you
consider that most users rarely check out recommendations
passed rank 5 in most commercial domains.
Figure 6 plots the same results, but with recall being a
frequency count instead of a normalized percentage. Now it’s
even more clear, what makes like2vec so unique. If the
recommender system could sugest only one movie, it would
outperform the baseline more than twice. Like2Vec is
optimized to prioritize early sugestion of great movies at the
expense of poor long-tail behavior. This is highlighted here by
the fact that the curve is close to the y-axis.
Figure 7: Grid-Search on Recall Score
Hyper-parameter tuning through cross-validation was
performed on the embedding size and the number of random
walk generated per node. Figure 7 plots the grid search results
for the cross-over recall score. The cross-over recall score is
defined as the number of top-N recommendation possible
before like2vec stops outperforming the log-likelihood
baseline. Both vector embeddings of 300 and 500 dimenions
gave the best cross-over recall score. Although the larger
embedding was able to achieve this score with less random
walks per nodes.
A. RMSE
Figure 8: Baseline RMSE Score
The performance of Like2Vec was also evaluated with
RMSE on the predicted ratings. Predicitons were both made
with a naïve and weighted averaging. Figure 8 plot the result
for the baseline and Figure 9 for like2vec.
Figure 9: Like2Vec RMSE Score
Like2Vec outperformes the baseline in both predcition
schemes. Althought Like2Vec is optimized for recall
performance, it can still produce great perdictions in a RMSE
setting. The baseline suffered significant performance drop
when not using a weighted average for prediction. On the other
hand, Like2Vec performed almost identically in both
situations. Such behavior highlight the quality of the movie
embeddings. A projection of movie vectors along all the
embedded dimenions is enough to give a rich and accurate
similarity metric between the movies, allowing for a robust
rating prediction.
Figure 10: Grid-Search on RMSE Score
Hyper-parameter tuning through cross-validation was also
performed for the RMSE evaluation metric. Figure 10 plots the
grid search results. Note, the optimal performance of the model
are achieved at very different hyper-parameter configurations.
This should not come at a surprise, since as disscued earlier
RMSE and Recall-at-N do not test the same behavior.
Therefore tuning a recommender system with the wrong metric
will make for a sub-obtimal model.
A. COLLABORATIVE FILTERING
Another grid-search was performed for a second more
challenging baseline. ALS collaborative filtering is the go to
model for most recommender systems. Here again Like2Vec
outperforms ALS for all hyper-parameter configurations
tested.
IX. CONLUSION
In this paper, we introduce Like2Vec a novel approach to a
recommender system. We combine multiple machine learning
techniques to transform a high dimensional dataset into a
social graph. Community information was extracted from the
graph by borrowing ideas from neural language models,
producing dense latent representations. These latent
representations encode rich features in a lower dimensional
continuous vector space.
Like2vec resulted in high prediction rating peformance, as
well as, promising Recall-at-N results. It was shown that
Recall-at-N can force the model to prioritize surfacing highly
relevent content early in the ranking. Such behavior would be
penalized in an RMSE evaluation setting, but with Recall-at-N
it is instead promoted at the expense of the long tail
performance.
References
[1] R. Andersen, F. Chung, and K. Lang. Local graph partitioning using
pagerank vectors. In Foundations of Computer Science, 2006. FOCS’06.
47th Annual IEEE Symposium on, pages 475–486. IEEE, 2006.
[2] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A
review and new perspectives. 2013.
[3] Y. Bengio, R. Ducharme, and P. Vincent. A neural probabilistic
language model. Journal of Machine Learning Research, 3:1137–1155,
2003.
[4] L. Bottou. Stochastic gradient learning in neural networks. In
Proceedings of Neuro-Nˆımes 91, Nimes, France, 1991. EC2.
[5] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey.
ACM Computing Surveys (CSUR), 41(3):15, 2009.
[6] R. Collobert and J. Weston. A unified architecture for natural language
processing: Deep neural networks with multitask learning. In
Proceedings of the 25th international conference on Machine learning,
pages 160–167. ACM, 2008.
[7] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-
trained deep neural networks for large-vocabulary speech recognition.
Audio, Speech, and Language Processing, IEEE Transactions on,
20(1):30–42, 2012.
[8] R. Bambini, P. Cremonesi, and R. Turrin. Recommender Systems
Handbook, chapter A Recommender System for an IPTV Service
Provider: a Real Large-Scale Production Environment. Springer, 2010.
[9] P. Cremonesi, E. Lentini, M. Matteucci, and
R. Turrin. An evaluation
methodology for recommender systems. 4th Int. Conf. on Automated
Solutions for Cross Media Content and Multi-channel Distribution,
pages 224–231, Nov 2008.
[10] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A
review and new perspectives. 2013.

More Related Content

What's hot

The Download: Tech Talks by the HPCC Systems Community, Episode 16
The Download: Tech Talks by the HPCC Systems Community, Episode 16The Download: Tech Talks by the HPCC Systems Community, Episode 16
The Download: Tech Talks by the HPCC Systems Community, Episode 16
HPCC Systems
 
PAGE: A Partition Aware Engine for Parallel Graph Computation
PAGE: A Partition Aware Engine for Parallel Graph ComputationPAGE: A Partition Aware Engine for Parallel Graph Computation
PAGE: A Partition Aware Engine for Parallel Graph Computation
1crore projects
 
Factorization Meets the Item Embedding: Regularizing Matrix Factorization wit...
Factorization Meets the Item Embedding: Regularizing Matrix Factorization wit...Factorization Meets the Item Embedding: Regularizing Matrix Factorization wit...
Factorization Meets the Item Embedding: Regularizing Matrix Factorization wit...
Dawen Liang
 
Paper id 25201463
Paper id 25201463Paper id 25201463
Paper id 25201463
IJRAT
 

What's hot (20)

The Download: Tech Talks by the HPCC Systems Community, Episode 16
The Download: Tech Talks by the HPCC Systems Community, Episode 16The Download: Tech Talks by the HPCC Systems Community, Episode 16
The Download: Tech Talks by the HPCC Systems Community, Episode 16
 
Machine learning with graph
Machine learning with graphMachine learning with graph
Machine learning with graph
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project Presentation
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
A Deep Learning Model to Predict Congressional Roll Call Votes from Legislati...
A Deep Learning Model to Predict Congressional Roll Call Votes from Legislati...A Deep Learning Model to Predict Congressional Roll Call Votes from Legislati...
A Deep Learning Model to Predict Congressional Roll Call Votes from Legislati...
 
Svv
SvvSvv
Svv
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Using transfer learning for video popularity prediction
Using transfer learning for video popularity predictionUsing transfer learning for video popularity prediction
Using transfer learning for video popularity prediction
 
PAGE: A Partition Aware Engine for Parallel Graph Computation
PAGE: A Partition Aware Engine for Parallel Graph ComputationPAGE: A Partition Aware Engine for Parallel Graph Computation
PAGE: A Partition Aware Engine for Parallel Graph Computation
 
Factorization Meets the Item Embedding: Regularizing Matrix Factorization wit...
Factorization Meets the Item Embedding: Regularizing Matrix Factorization wit...Factorization Meets the Item Embedding: Regularizing Matrix Factorization wit...
Factorization Meets the Item Embedding: Regularizing Matrix Factorization wit...
 
ML Interpretability Inside Out
ML Interpretability Inside OutML Interpretability Inside Out
ML Interpretability Inside Out
 
Human in the loop: Bayesian Rules Enabling Explainable AI
Human in the loop: Bayesian Rules Enabling Explainable AIHuman in the loop: Bayesian Rules Enabling Explainable AI
Human in the loop: Bayesian Rules Enabling Explainable AI
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
Communication systems-theory-for-undergraduate-students-using-matlab
Communication systems-theory-for-undergraduate-students-using-matlabCommunication systems-theory-for-undergraduate-students-using-matlab
Communication systems-theory-for-undergraduate-students-using-matlab
 
Odsc 2019 entity_reputation_knowledge_graph
Odsc 2019 entity_reputation_knowledge_graphOdsc 2019 entity_reputation_knowledge_graph
Odsc 2019 entity_reputation_knowledge_graph
 
Paper id 25201463
Paper id 25201463Paper id 25201463
Paper id 25201463
 
Hierarchical p2 p
Hierarchical p2 pHierarchical p2 p
Hierarchical p2 p
 
Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...
Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...
Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...
 

Similar to Marvin_Capstone

Poster Abstracts
Poster AbstractsPoster Abstracts
Poster Abstracts
butest
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
butest
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
butest
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
butest
 
$$ Using statistics to search and annotate pictures an evaluation of semantic...
$$ Using statistics to search and annotate pictures an evaluation of semantic...$$ Using statistics to search and annotate pictures an evaluation of semantic...
$$ Using statistics to search and annotate pictures an evaluation of semantic...
mhmt82
 
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object TrackingIntegrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
ijsrd.com
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
ijceronline
 
MultiObjective(11) - Copy
MultiObjective(11) - CopyMultiObjective(11) - Copy
MultiObjective(11) - Copy
AMIT KUMAR
 

Similar to Marvin_Capstone (20)

2307.09793.pdf
2307.09793.pdf2307.09793.pdf
2307.09793.pdf
 
Poster Abstracts
Poster AbstractsPoster Abstracts
Poster Abstracts
 
M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
 
SHAHBAZ_TECHNICAL_SEMINAR.docx
SHAHBAZ_TECHNICAL_SEMINAR.docxSHAHBAZ_TECHNICAL_SEMINAR.docx
SHAHBAZ_TECHNICAL_SEMINAR.docx
 
IEEE 2014 C# Projects
IEEE 2014 C# ProjectsIEEE 2014 C# Projects
IEEE 2014 C# Projects
 
IEEE 2014 C# Projects
IEEE 2014 C# ProjectsIEEE 2014 C# Projects
IEEE 2014 C# Projects
 
mapReduce for machine learning
mapReduce for machine learning mapReduce for machine learning
mapReduce for machine learning
 
Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep Learning
 
The Smart Way To Invest in AI and ML_SFStartupDay
The Smart Way To Invest in AI and ML_SFStartupDayThe Smart Way To Invest in AI and ML_SFStartupDay
The Smart Way To Invest in AI and ML_SFStartupDay
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
$$ Using statistics to search and annotate pictures an evaluation of semantic...
$$ Using statistics to search and annotate pictures an evaluation of semantic...$$ Using statistics to search and annotate pictures an evaluation of semantic...
$$ Using statistics to search and annotate pictures an evaluation of semantic...
 
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object TrackingIntegrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
 
Image processing project list for java and dotnet
Image processing project list for java and dotnetImage processing project list for java and dotnet
Image processing project list for java and dotnet
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
 
AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...
AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...
AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...
 
MultiObjective(11) - Copy
MultiObjective(11) - CopyMultiObjective(11) - Copy
MultiObjective(11) - Copy
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...
 

Marvin_Capstone

  • 1. Like2Vec: A Graph Embedding Techinque for Recommender Systems Marvin Bertin – marvin.bertin@gmail.com Michael Ulin - michaelulin@gmail.com Mike Tamir – mntamir@gmail.com Jacob Baumbach - jwbaum91@gmail.com David Ott –davidott4@gmail.com Data Science Master Program GalvanizeU San Francisco, USA Abstract— User behavior datasets, such as the Netflix dataset, are typically sparse and high dimensional. For this reason, recommender systems tend to be based on matrix factorization compression algorithms, instead of traditional statistical models. Like2Vec proposes a novel approach to transform a sparse dataset into a graph representation, followed by a neural embedding of the nodes into a rich dense latent feature space. The distance between these latent dimensions can be used to compute a similarity metric between movies. These vector projections allow the model to surface and recommend highly relevant movies for the respective users. We show that Like2Vec outperforms standard baselines in both the RMSE and Recall-at-N evaluation metric. Different evaluation metrics lead to different hyper-parameter configuration. We argue Recall-at-N is a superior metric for evaluating recommender systems, since it provides a better assessment of the quality of top-N recommendations. Keywords—recommender system; neural networks; graphical model; neural embedding; log-likelihood ratio; latent space I. INTRODUCTION Statistical models are among the most popular techniques in machine learning for supervised predictive tasks. Logistic regression and linear regression are still among the most commonly used algorithms today. They have proven to be robust and powerful models across many domains. However, there is a type of data under which such models regularly under-perform. When the data is high dimensional and sparse, machine learning models suffer from what is often referred as the curse of dimensionality. When dimensionality increases, the volume of the space increases so fast that the available data become sparse. This scarcity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. Figure 1: Input graph on the left is embedded into continuous vector space on the right. Moreover, distances between data points loose their significance, since all objects appear to be sparse and dissimilar in many ways, which hinders statistical models from learning meaningful patterns. Unstructured text is a great example where statistical models struggle. A text corpus can be represented as a document-term frequency matrix. Such matrix typically has dimensions in the order of 105 to 106 and is highly sparse. Another example is user behavior data. Where both the number of users and items can be very large, but any given user will only ever interact with a tiny fraction of the item set on average. Amazon purchase history or Netflix viewing history are such type of data. There are many ways to deal with such data, one of which is to compress it into a lower dimensional rich feature space, where the fall backs of the curse of dimensionality are mitigated, allowing traditional statistical models to function adequately. In text, a very popular compression model is Word2Vec, a neural network word embedding algorithm. It is a type of neural language model that has been used to capture the semantic and syntactic structure of human language [3], and even logical analogies [4]. In the context of user-item matrices, a common compression algorithm is matrix factorization. There exist
  • 2. many different types of matrix factorization algorithm such as PCA, SVD and non-negative matrix factorization. These techniques decompose a high dimensional matrix into its lower dimensional components, while maintaining the underlining signal in the data. This transformed space is sometime referred as the latent topic space. This new space contains rich dense features that were latent (hidden) in the original feature space. Like2Vec takes inspiration from these compression methods and combine it with a graphical representation to build a new kind of recommender system, evaluated on the famous Netflix dataset. II. RECOMMENDER ALGORITHMS Figure 2: Matrix Factorization A. COLLABORATIVE ALGORITHMS Most recommender systems are based on collaborative filtering, where recommendations rely only on past user behavior. In the Netflix dataset, used in our evaluation, such information is in the form of rated user viewing history. There are two primary approaches to collaborative filtering: the neighborhood approach and the latent factor approach. Neighborhood models represent the most common approach. They are based on the similarity among either users or items. For instance, two users are similar because they have rated similarly the same set of items. Similarity between items can also be deduced with the same logic. Latent factor model approaches users and items as vectors in the same ‘latent factor’ space by means of a reduced number of hidden factors. In such a space, users and items are directly comparable: the rating of user u on item i is predicted by the proximity, for example inner-product, between the related latent factor vectors. One of the unique characteristic of the like2vec model is that it incorporates elements of both approaches. The latent factors are encoded in the dimensions of the dense vector embedding and the neighborhood information is a by-product of the SkipGram movie representation. III. GRAPH EMBEDDING TECHINQUE The scarcity of a graph representation is both a strength and a weakness. Scarcity enables the design of efficient discrete algorithms, but can make it harder to generalize in statistical learning. Machine learning applications with graphs, such as in here with movie recommendation [6] must be able to deal with this scarcity. Figure 3: DeepWalk A. DEEPWALK DeepWalk is an approach for learning latent representations of vertices in a graph. These latent representations extract meaningful and structural information and encodes them in a continuous vector space, which is then easily exploited by statistical models. DeepWalk can be viewed as a generalization of Word2Vec embedding representation. Word2vec language models are composed of artificial neural networks stacked to form an auto-encoder. These models have proven particularly useful at compressing high dimensional sparse representation of text into dense low dimensional vectors. From just sequences of words in a corpus, the training generates unsupervised features through back-propagation of the neural network. DeepWalk takes it a steps further than word2vec and generates its own “corpus”. The graph is explored by a series of truncated random walks. By treating the walks as the equivalent of sentences and nodes as word, the algorithm is able to generate arbitrary sized corpuses. This synthetic language captures the community information present in the graph. Traditional neural language models can then be used to extract rich movie embeddings from the corpus. DeepWalk’s latent representations has been evaluated on several multi-label network classification tasks for social networks such as BlogCatalog, Flickr, and YouTube. Results show that DeepWalk outperforms challenging baselines, especially in the presence of missing information. DeepWalk’s representations can provide F1 scores up to 10% higher than competing methods when labeled data is sparse. In some experiments, DeepWalk’s representations are able to outperform all baseline methods while using 60% less training data.
  • 3. DeepWalk is also scalable. It is an online learning algorithm which builds useful incremental results, and is trivially parallelizable. These qualities make it suitable for real world task such as the Netflix dataset, which is challenging high dimensional sparse dataset. IV. LANGUAGE MODEL Figure 4: SkipGram Language Model The goal of language modeling is estimate the likelihood of a specific sequence of words appearing in a corpus. More formally, given a sequence of words where wi ∈ V (V is the vocabulary), we would like to maximize: over all the training corpus. DeepWalk generalizes such language modeling by exploring the graph through a stream of short random walks. These walks can be thought of short sentences and phrases in a special language. The direct analog is to estimate the likelihood of observing vertex vi given all the previous vertices visited so far in the random walk. A. SKIPGRAM The language model used in like2vec is the SkipGram algorithm. It maximizes the co-occurrence probability among the words that appear within a window, w, in a sentence [7]. SkipGram language model has the following characteristics: • Instead of using the context to predict a missing word, it uses one word to predict the context. • The context is composed of the words appearing to right side of the given word as well as the left side. • It removes the ordering constraint on the problem. • The model is required to maximize the probability of any word appearing in the context without the knowledge of its offset from the given word. By generated synthetics sentences from a stream of random walks and using the corpus as input to the SkipGram language model, we’re able to build representations that capture the shared similarities in local graph structure between vertices. Vertices which have similar neighborhoods will acquire similar representations, and allowing generalization on machine learning tasks. V. EVALUATION METRICS The goal of the recommender system is to surface a list of items which are the most relevant or appealing to a specific user. This is referred to as a top-N recommendation task. A common practice in industry and academia is to evaluate recommender systems’ performance through error metrics such as the RMSE (root mean squared error), which capture the average error between the actual ratings and the ratings predicted by the model. However, such evaluation metric is not a natural fit for evaluating a top-N recommendation task. An extensive evaluation of several state-of-the art recommender algorithms suggests that algorithms optimized for minimizing RMSE do not necessarily perform as expected and often do not translate into accuracy improvements. Direct evaluation of top-N performance must be accomplished by means of alternative methodologies based on accuracy metrics, such as recall and precision. Accordingly, this experiment will evaluate the Like2Vec model with both the classical RMSE metric and the more appropriate method called Recall-at-N. A. RECALL-AT-N Recall-at-N evaluation metric attempts to directly assess the quality of top-N recommendations, in a way that RMSE cannot. The dataset with known ratings, is first split into two subsets: training set M and test set T. The model is trained with the ratings in M and then evaluated on the test set T. A special characteristic of the test set is that it contains only 5- stars ratings. The goal is to construct a test set that only contains highly relevant items for the respective users.
  • 4. In order to measure recall and precision, we perform the following steps for each movie i rated 5-stars by user u in T: i. From user u viewing history, surface the movie that is most similar to movie i, based on the cosine similarity of the embedded vectors. ii. Randomly select 1000 additional movies unrated by user u. The assumption is that most of these items will not be of interest to user u. iii. Compute the same similarity score for the additional 1000 movies. iv. Generate a ranked list by ordering all the 1001 movies according to their predicted ratings. Let p denote the rank of the test movie i within this list. The best result corresponds to the case where the test movie i precedes all the random items (i.e., p = 1). v. We form a top-N recommendation list by picking the N top ranked items from the list. If p ≤ N we have a hit (i.e., the test item i is recommended to the user). Otherwise we have a miss. Chances of hit increase with N. When N = 1001 we always have a hit. The computation of recall and precision proceeds as follows. The overall recall and precision are defined by averaging over all test cases: where |T| is the number of test ratings. It is important to note that Recall-at-N underestimate the computed recall and precision with respect to true recall and precision. It must be viewed as a lower-bound of the recommender system’s performance. This stems from the hypothesis that all 1000 random movies are non-relevant to user u. VI. DATA SET A. NETFLIX DATA SET The Netflix data set contains movie viewing behavior of 480,189 users. It is a record of 100,480,507 ratings from 0-5, distributed across 17,770 movies. It is used to construct models that predict user ratings, based on previous ratings without any other information about the users or movies. Predictions have been traditionally scored against the true ratings in terms of root mean squared error (RMSE), and the goal is to reduce this error as much as possible. VII. IMPLEMENTATION A. DATA PREPROCESSING & GRAPH BUILDING A graph can be thought of as a representation of a sparse square matrix, where the dimensions are each of the nodes and the non-zero entries are the edges’ weight connecting the nodes. The Netflix dataset is not in a graph format and need to be transformed appropriately. In this paper, we explore two ways to build a graph out of the user-item matrix, where each non-zero entry is a rating between 0 and 5. Covariance matrix: the first method takes the user-item matrix and matrix multiple it with its transpose. The result is a symmetric square matrix, where the non-zero entries represent some measure of co-occurrence mututal information. This matrix can then be represented as an un-directed graph. Depending on the order of the matrix multiplication, the nodes will either represent movies or users. In this exploration phase of Like2Vec, only graphs with movies as nodes were studied. Embedding movies into dense vectors reduced the model’s dimensionality by several orders of magnitude compared to doing the same with the users. Log-likelihood matrix: In statistics, a log likelihood ratio test is a statistical test used to compare the goodness of fit of two models, one of which is a special case of the other . The test is based on the likelihood ratio, which expresses how many times more likely the data are under one model than the other. We used this idea to compute a score to analyze counts of events that occur together. The log-likelihood ratio estimates how many times more likely two items are to co-occur as oppose to not. Its main advantage is that it corrects for unbalanced occurrences of items. It’s a co-occurrence metric that is weighted by the global occurrence of an item. In this way, obscure movies are not drowned by other globally popular items. Another advantage is that it does not take into account the ratings, which tend to be a noisy metric based on its highly subjective nature. Moreover, this approach can also be used in other domains where ratings are not available, for example a user’s product purchase history. A. RANDOM WALKS Random walks are generated by picking a graph node at random and traversing the network for 40 steps. The probability of each next step is proportional to the weight on the edges connected to the current node. The sequence travelled is recorded and added to the “corpus”. This exact same process is repeated multiple times for every node, in order to fully explore the graph structure. A. MOVIE EMBEDDING Once the corpus of random walks is generated, it can then be passed into the SkipGram language model. With an average sliding window size of 6 items, the movie vectors are trained multiple times with different embedding size. A. EVALUATION The evaluation is performed on two evaluation metrics. The preferred metric for recommender systems, Recall-at-N,
  • 5. and the traditional RMSE in order to compare the results with a collaborative filtering baseline. For the calculation of RMSE, the ratings first need to be computed. For every movie in the test set, we picked the top-k movies seen by the respective user that is most similar to the movie being tested. The ratings are calculated by computing either a naïve average or a weighted average (based on similarity scores) of the top-k movies. VIII.RESULTS Figure 5: Recall-at-N (%) A. RECALL-AT-N Like2Vec’s Recall-at-N score was evaluated at different values of N. Where N is the number of top-N recommendation needed to surface the test set movie in question. Such analysis gives an idea of the range of performance the recommender system is capable off. The full behavior of the model is visualized, which allows for a better comparsion between other systems. In a commercial setting, an indepth understand of the model permits optimal tuning of the algoritm. Figure 6: Recall-at-N (Hit Freq.) Figure 5 plots the recall score for both Like2Vec and a baseline recommender. The baseline is essentially the same model, but with the DeepWalk featurization removed. The recall score is computed straight from the log-likelyhood ratio as the similarity metric. Like2Vec clearly outperforms the baseline for small N. This is a favorable behavior for a recommender system. It means that like2vec is able to retrive highly relevent five start movie very early in its ranked list. There is a cross-over around rank 7, but this is alright if you consider that most users rarely check out recommendations passed rank 5 in most commercial domains. Figure 6 plots the same results, but with recall being a frequency count instead of a normalized percentage. Now it’s even more clear, what makes like2vec so unique. If the recommender system could sugest only one movie, it would outperform the baseline more than twice. Like2Vec is optimized to prioritize early sugestion of great movies at the expense of poor long-tail behavior. This is highlighted here by the fact that the curve is close to the y-axis. Figure 7: Grid-Search on Recall Score Hyper-parameter tuning through cross-validation was performed on the embedding size and the number of random walk generated per node. Figure 7 plots the grid search results for the cross-over recall score. The cross-over recall score is defined as the number of top-N recommendation possible before like2vec stops outperforming the log-likelihood baseline. Both vector embeddings of 300 and 500 dimenions gave the best cross-over recall score. Although the larger embedding was able to achieve this score with less random walks per nodes. A. RMSE Figure 8: Baseline RMSE Score
  • 6. The performance of Like2Vec was also evaluated with RMSE on the predicted ratings. Predicitons were both made with a naïve and weighted averaging. Figure 8 plot the result for the baseline and Figure 9 for like2vec. Figure 9: Like2Vec RMSE Score Like2Vec outperformes the baseline in both predcition schemes. Althought Like2Vec is optimized for recall performance, it can still produce great perdictions in a RMSE setting. The baseline suffered significant performance drop when not using a weighted average for prediction. On the other hand, Like2Vec performed almost identically in both situations. Such behavior highlight the quality of the movie embeddings. A projection of movie vectors along all the embedded dimenions is enough to give a rich and accurate similarity metric between the movies, allowing for a robust rating prediction. Figure 10: Grid-Search on RMSE Score Hyper-parameter tuning through cross-validation was also performed for the RMSE evaluation metric. Figure 10 plots the grid search results. Note, the optimal performance of the model are achieved at very different hyper-parameter configurations. This should not come at a surprise, since as disscued earlier RMSE and Recall-at-N do not test the same behavior. Therefore tuning a recommender system with the wrong metric will make for a sub-obtimal model. A. COLLABORATIVE FILTERING Another grid-search was performed for a second more challenging baseline. ALS collaborative filtering is the go to model for most recommender systems. Here again Like2Vec outperforms ALS for all hyper-parameter configurations tested. IX. CONLUSION In this paper, we introduce Like2Vec a novel approach to a recommender system. We combine multiple machine learning techniques to transform a high dimensional dataset into a social graph. Community information was extracted from the graph by borrowing ideas from neural language models, producing dense latent representations. These latent representations encode rich features in a lower dimensional continuous vector space. Like2vec resulted in high prediction rating peformance, as well as, promising Recall-at-N results. It was shown that Recall-at-N can force the model to prioritize surfacing highly relevent content early in the ranking. Such behavior would be penalized in an RMSE evaluation setting, but with Recall-at-N it is instead promoted at the expense of the long tail performance. References [1] R. Andersen, F. Chung, and K. Lang. Local graph partitioning using pagerank vectors. In Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, pages 475–486. IEEE, 2006. [2] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. 2013. [3] Y. Bengio, R. Ducharme, and P. Vincent. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137–1155, 2003. [4] L. Bottou. Stochastic gradient learning in neural networks. In Proceedings of Neuro-Nˆımes 91, Nimes, France, 1991. EC2. [5] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys (CSUR), 41(3):15, 2009. [6] R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM, 2008. [7] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre- trained deep neural networks for large-vocabulary speech recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 20(1):30–42, 2012. [8] R. Bambini, P. Cremonesi, and R. Turrin. Recommender Systems Handbook, chapter A Recommender System for an IPTV Service Provider: a Real Large-Scale Production Environment. Springer, 2010. [9] P. Cremonesi, E. Lentini, M. Matteucci, and
R. Turrin. An evaluation methodology for recommender systems. 4th Int. Conf. on Automated Solutions for Cross Media Content and Multi-channel Distribution, pages 224–231, Nov 2008. [10] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. 2013.