RecSys Challenge 2014, SemWexMFF group

Hybrid Biased k-NN to Predict Movie Tweets Popularity
Ladislav Peška
Department of Software Engineering
Charles University in Prague
Malostranske namesti 25, Prague, Czech Republic
Peter Vojtáš
Department of Software Engineering
Charles University in Prague
Malostranske namesti 25, Prague, Czech Republic
ABSTRACT
In this paper we describe approach of our SemWexMFF group to
the RecSys Challenge 2014. Target of the challenge was to predict
level of user engagement on tweets generated automatically from
IMDB.
During experiments we have tested several state-of-the-art
prediction techniques and proposed a variant of item based k-NN
algorithm, which better reflects user engagement and nature of
the movie domain content-based attributes.
Our final solution (placed in the midfield of the challenge leader
board) is an aggregation of several runs of this algorithm.
OUR APPROACH
Our approach follows two main hypothesis:
1. Engagement of similar objects should be similar
2. Engagement depends on neighborhood (friends and
followers) of current user
Similarity of objects is based on their content-based attributes:
Numeric attributes: normalized distance
푠푖푚푥,푦 ,푚푎푥퐷푖푠푡 = max⁡ 0,
푚푎푥퐷푖푠푡 − 푥 − 푦
푚푎푥퐷푖푠푡
String attributes: relative Levenshtein distance
푠푖푚푥,푦 = 1 −⁡
푙푒푣푒푛푠ℎ푡푒푖푛(푥, 푦)
max⁡(푙푒푛푔ℎ푡 푥 , 푙푒푛푔ℎ푡(푦))
Nominal/Set attributes: Jaccard similarity
푠푖푚퐱,퐲 = | 퐱 ∩ 퐲 | | 퐱 ∪ 퐲 |
Effect of user’s friends and followers was approximated by user
bias.
Movie attributes were queried from OMDB API:
- average rating, number of awards, IMDB metascore
- number of ratings
- movie name, release date, genre, country, language, director, actors
RESULTS
HYBRID BIASED k-NN
For tweet tID, its movie mID and fixed k, the algorithm first
compute similarities to other movies and selects k most
similar movies. Then for each tweet about the movie the
predicted ranking 푟 is increased according to similarity 푠 ,
user engagement r and bias of the tweeting user. The bias
of the current movie is added in the final 푟 prediction too.
function HybridBiasedKNN(tID, mID , k){
푟 = 0;
/*compute similarity for all movies */
foreach(mID ϵ TrainSet){
S[mID ] = similarity(mID , mID );
}
S = getKMostSimilar(S,k);
/*get all tweets about movies in S */
foreach({uID, mID, r, 푠 }:
{uID, mID, r} ϵ TrainSet && S [mID]= 푠 ){
푟 += 푠 * r / (bias(uID) + ε );
}
푟 = bias(mID ) + (푟 / sum(푠 ))
return 푟 ;
}
Braveheart
TID: 421065455743541248
UID: 25813709
The Patriot
8.4
7.1
1995
2000
68
63
Action, Biography,
Drama
Action, Drama,
War
Mel Gibson; James Robinson; Sean
Lawlor; Sandy Nelson; James Cosmo
Mel Gibson; Heath Ledger; Joely
Richardson; Jason Isaacs
Rating Year
IMDB
metascore
Genre Actors
TID: 410808483345465344
UID: 307867510
Engagement: 0
TID: 421040870931320833
UID: 296041028
Engagement: 3 …
AVG Eng.: 0.0
AVG Eng.: 0.3250
AVG Eng.: 0.024
AVG Eng.: 0.001
Results of state-of-the-art methods
Method nDCG
Random predictions 0.7482
Bi-Polar Slope One 0.7652
Factor Wise Matrix Factorization 0.7556
Item-Item k-NN 0.7604
Decision Tree 0.7494
Support Vector Machines (SVM) 0.8057
Results of Hybrid k-NN using only one attribute
Method nDCG Method nDCG
AVG rating 0.7918 Genres 0.7919
Awards 0.7652 Countries 0.7984
IMDB Metascore 0.8057 Languages 0.8005
Number of ratings 0.7964 Director 0.8029
Movie name 0.7947 Actors 0.7930
Release year 0.7962
Results of Hybrid k-NN combining more attributes
Method nDCG
Hybrid k-nn (Metascore, Language, Director, Country,
Date, # of ratings)
0.7927
Hybrid k-nn(Metascore, Language, Director, Country,
Date, # of ratings), no bias
0.7792
Linear Regression (Metascore, Language, Director,
Country, Date, # of ratings)
0.7913
AVG (Metascore, Language, Director, Country, Date,
# of ratings), omit best and worst prediction
0.8134
LESSONS LEARNED and POSSIBLE EXTENSIONS
Both hypothesis on which we based our solution seems to be confirmed. Omitting user bias lead to severe decrease of algorithm success
metrics and almost all content-based attributes proved to be quite good measure of movie similarity.
Hybrid biased k-NN outperformed all considered state-of-the-art machine learning methods, however our results were placed in lower
midfield of the challenge. Several extensions to the current approach is possible, namely:
-Considering temporal dependance in the dataset
-Using some of the tweet characteristics, additional content-based attributes e.g. from DBPedia or some other meta-learning methods

RecSys Challenge 2014, SemWexMFF group

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to RecSys Challenge 2014, SemWexMFF group

Similar to RecSys Challenge 2014, SemWexMFF group (20)

More from Ladislav Peska

More from Ladislav Peska (10)

Recently uploaded

Recently uploaded (20)

RecSys Challenge 2014, SemWexMFF group