Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

RecSys Challenge 2014, SemWexMFF group


Published on

RecSys Challenge 2014, SemWexMFF group,
Ladislav Peska and Peter Vojtas

Published in: Science
  • Be the first to comment

  • Be the first to like this

RecSys Challenge 2014, SemWexMFF group

  1. 1. Hybrid Biased k-NN to Predict Movie Tweets Popularity Ladislav Peška Department of Software Engineering Charles University in Prague Malostranske namesti 25, Prague, Czech Republic Peter Vojtáš Department of Software Engineering Charles University in Prague Malostranske namesti 25, Prague, Czech Republic ABSTRACT In this paper we describe approach of our SemWexMFF group to the RecSys Challenge 2014. Target of the challenge was to predict level of user engagement on tweets generated automatically from IMDB. During experiments we have tested several state-of-the-art prediction techniques and proposed a variant of item based k-NN algorithm, which better reflects user engagement and nature of the movie domain content-based attributes. Our final solution (placed in the midfield of the challenge leader board) is an aggregation of several runs of this algorithm. OUR APPROACH Our approach follows two main hypothesis: 1. Engagement of similar objects should be similar 2. Engagement depends on neighborhood (friends and followers) of current user Similarity of objects is based on their content-based attributes: Numeric attributes: normalized distance 푠푖푚푥,푦 ,푚푎푥퐷푖푠푡 = max⁡ 0, 푚푎푥퐷푖푠푡 − 푥 − 푦 푚푎푥퐷푖푠푡 String attributes: relative Levenshtein distance 푠푖푚푥,푦 = 1 −⁡ 푙푒푣푒푛푠ℎ푡푒푖푛(푥, 푦) max⁡(푙푒푛푔ℎ푡 푥 , 푙푒푛푔ℎ푡(푦)) Nominal/Set attributes: Jaccard similarity 푠푖푚퐱,퐲 = | 퐱 ∩ 퐲 | | 퐱 ∪ 퐲 | Effect of user’s friends and followers was approximated by user bias. Movie attributes were queried from OMDB API: - average rating, number of awards, IMDB metascore - number of ratings - movie name, release date, genre, country, language, director, actors RESULTS HYBRID BIASED k-NN For tweet tID, its movie mID and fixed k, the algorithm first compute similarities to other movies and selects k most similar movies. Then for each tweet about the movie the predicted ranking 푟 is increased according to similarity 푠 , user engagement r and bias of the tweeting user. The bias of the current movie is added in the final 푟 prediction too. function HybridBiasedKNN(tID, mID , k){ 푟 = 0; /*compute similarity for all movies */ foreach(mID ϵ TrainSet){ S[mID ] = similarity(mID , mID ); } S = getKMostSimilar(S,k); /*get all tweets about movies in S */ foreach({uID, mID, r, 푠 }: {uID, mID, r} ϵ TrainSet && S [mID]= 푠 ){ 푟 += 푠 * r / (bias(uID) + ε ); } 푟 = bias(mID ) + (푟 / sum(푠 )) return 푟 ; } Braveheart TID: 421065455743541248 UID: 25813709 The Patriot 8.4 7.1 1995 2000 68 63 Action, Biography, Drama Action, Drama, War Mel Gibson; James Robinson; Sean Lawlor; Sandy Nelson; James Cosmo Mel Gibson; Heath Ledger; Joely Richardson; Jason Isaacs Rating Year IMDB metascore Genre Actors TID: 410808483345465344 UID: 307867510 Engagement: 0 TID: 421040870931320833 UID: 296041028 Engagement: 3 … AVG Eng.: 0.0 AVG Eng.: 0.3250 AVG Eng.: 0.024 AVG Eng.: 0.001 Results of state-of-the-art methods Method nDCG Random predictions 0.7482 Bi-Polar Slope One 0.7652 Factor Wise Matrix Factorization 0.7556 Item-Item k-NN 0.7604 Decision Tree 0.7494 Support Vector Machines (SVM) 0.8057 Results of Hybrid k-NN using only one attribute Method nDCG Method nDCG AVG rating 0.7918 Genres 0.7919 Awards 0.7652 Countries 0.7984 IMDB Metascore 0.8057 Languages 0.8005 Number of ratings 0.7964 Director 0.8029 Movie name 0.7947 Actors 0.7930 Release year 0.7962 Results of Hybrid k-NN combining more attributes Method nDCG Hybrid k-nn (Metascore, Language, Director, Country, Date, # of ratings) 0.7927 Hybrid k-nn(Metascore, Language, Director, Country, Date, # of ratings), no bias 0.7792 Linear Regression (Metascore, Language, Director, Country, Date, # of ratings) 0.7913 AVG (Metascore, Language, Director, Country, Date, # of ratings), omit best and worst prediction 0.8134 LESSONS LEARNED and POSSIBLE EXTENSIONS Both hypothesis on which we based our solution seems to be confirmed. Omitting user bias lead to severe decrease of algorithm success metrics and almost all content-based attributes proved to be quite good measure of movie similarity. Hybrid biased k-NN outperformed all considered state-of-the-art machine learning methods, however our results were placed in lower midfield of the challenge. Several extensions to the current approach is possible, namely: -Considering temporal dependance in the dataset -Using some of the tweet characteristics, additional content-based attributes e.g. from DBPedia or some other meta-learning methods