Improving aggregate recommendation diversity using ranking based techniques.bak
896 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012 Improving Aggregate Recommendation Diversity Using Ranking-Based Techniques Gediminas Adomavicius, Member, IEEE, and YoungOk Kwon Abstract—Recommender systems are becoming increasingly important to individual users and businesses for providing personalized recommendations. However, while the majority of algorithms proposed in recommender systems literature have focused on improving recommendation accuracy (as exemplified by the recent Netflix Prize competition), other important aspects of recommendation quality, such as the diversity of recommendations, have often been overlooked. In this paper, we introduce and explore a number of item ranking techniques that can generate substantially more diverse recommendations across all users while maintaining comparable levels of recommendation accuracy. Comprehensive empirical evaluation consistently shows the diversity gains of the proposed techniques using several real-world rating data sets and different rating prediction algorithms. Index Terms—Recommender systems, recommendation diversity, ranking functions, performance evaluation metrics, collaborative filtering. Ç1 INTRODUCTIONI N the current age of information overload, it is becoming recommendations result in more opportunities for users to increasingly harder to find relevant content. This problem get recommended such items. With this motivation, someis not only widespread but also alarming . Over the last studies proposed new recommendation methods that can10-15 years, recommender systems technologies have been increase the diversity of recommendation sets for a givenintroduced to help people deal with these vast amounts of individual user, often measured by an average dissimilarityinformation , , , , , , and they have been between all pairs of recommended items, while maintainingwidely used in research as well as e-commerce applications, an acceptable level of accuracy , , , , .such as the ones used by Amazon and Netflix. http://ieeexploreprojects.blogspot.com recommendation diversity from an These studies measure The most common formulation of the recommendation individual user’s perspective (i.e., individual diversity).problem relies on the notion of ratings, i.e., recommender In contrast to individual diversity, which has beensystems estimate ratings of items (or products) that are yet explored in a number of papers, some recent studies ,to be consumed by users, based on the ratings of items  started examining the impact of recommender systems on sales diversity by considering aggregate diversity ofalready consumed. Recommender systems typically try to recommendations across all users. Note that high indivi-predict the ratings of unknown items for each user, often dual diversity of recommendations does not necessarilyusing other users’ ratings, and recommend top N items imply high aggregate diversity. For example, if the systemwith the highest predicted ratings. Accordingly, there have recommends to all users the same five best selling items thatbeen many studies on developing new algorithms that can are not similar to each other, the recommendation list forimprove the predictive accuracy of recommendations. each user is diverse (i.e., high individual diversity), but onlyHowever, the quality of recommendations can be evaluated five distinct items are recommended to all users andalong a number of dimensions, and relying on the accuracy purchased by them (i.e., resulting in low aggregate diversityof recommendations alone may not be enough to find the or high sales concentration).most relevant items for each user , . In particular, the While the benefits of recommender systems that provideimportance of diverse recommendations has been previously higher aggregate diversity would be apparent to manyemphasized in several studies , , , , , , users (because such systems focus on providing wider. These studies argue that one of the goals of range of items in their recommendations and not mostlyrecommender systems is to provide a user with highly bestsellers, which users are often capable of discovering byidiosyncratic or personalized items, and more diverse themselves), such systems could be beneficial for some business models as well , , , . For example, it would be profitable to Netflix if the recommender systems. G. Adomavicius is with the Department of Information and Decision can encourage users to rent “long-tail” type of movies (i.e., Sciences, Carlson School of Management, University of Minnesota, Minneapolis, MN 55455. E-mail: email@example.com. more obscure items that are located in the tail of the sales. Y. Kwon is with the Division of Business Administration, Sookmyung distribution ) because they are less costly to license and Women’s University, 53-12 Chungpa-Dong 2-Ka, Yongsan-Ku, Seoul, acquire from distributors than new-release or highly 140-742, Korea. E-mail: firstname.lastname@example.org. popular movies of big studios . However, the impactManuscript received 25 Aug. 2009; revised 29 Apr. 2010; accepted 25 Sept. of recommender systems on aggregate diversity in real-2010; published online 21 Dec. 2011. world e-commerce applications has not been well under-Recommended for acceptance by D. Cook.For information on obtaining reprints of this article, please send e-mail to: stood. For example, one study , using data from email@example.com, and reference IEEECS Log Number TKDE-2009-08-0630. clothing retailer, confirms the “long-tail” phenomenon thatDigital Object Identifier no. 10.1109/TKDE.2011.15. refers to the increase in the tail of the sales distribution (i.e., 1041-4347/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society
ADOMAVICIUS AND KWON: IMPROVING AGGREGATE RECOMMENDATION DIVERSITY USING RANKING-BASED TECHNIQUES 897the increase in aggregate diversity) attributable to the usage TABLE 1of the recommender system. On the other hand, another Accuracy-Diversity Tradeoff: Empirical Examplestudy  shows a contradictory finding that recommendersystems actually can reduce the aggregate diversity in sales.This can be explained by the fact that the idiosyncratic itemsoften have limited historical data and, thus, are moredifficult to recommend to users; in contrast, popular itemstypically have more ratings and, therefore, can be recom-mended to more users. For example, in the context ofNetflix Prize competition , , there is some evidence Note: Recommendations (top-1 item for each user) are generated forthat, since recommender systems seek to find the common 2,828 users among the items that are predicted above the acceptableitems (among thousands of possible movies) that two users threshold 3.5 (out of 5), using a standard item-based collaborativehave watched, these systems inherently tend to avoid filtering technique with 50 neighbors on MovieLens Data set.extremes and recommend very relevant but safe recom-mendations to users . approximately 2,000 available distinct items are recom- As seen from this recent debate, there is a growing mended across all users. The system can improve theawareness of the importance of aggregate diversity in diversity of recommendations from 49 up to 695 (a 14-foldrecommender systems. Furthermore, while, as mentioned increase) by recommending the long-tail item to each userearlier, there has been significant amount of work done on (i.e., the least popular item among highly predicted items forimproving individual diversity, the issue of aggregate each user) instead of the popular item. However, highdiversity in recommender systems has been largely un- diversity in this case is obtained at the significant expense oftouched. Therefore, in this paper, we focus on developing accuracy, i.e., drop from 82 to 68 percent.algorithmic techniques for improving aggregate diversity of The above example shows that it is possible to obtainrecommendations (which we will simply refer to as diversity higher diversity simply by recommending less popularthroughout the paper, unless explicitly specified otherwise), items; however, the loss of recommendation accuracy inwhich can be intuitively measured by the number of this case can be substantial. In this paper, we explore newdistinct items recommended across all users. recommendation approaches that can increase the diversity Higher diversity (both individual and aggregate), how- of recommendations with only a minimal (negligible)ever, can come at the expense of accuracy. As known well, accuracy loss using different recommendation rankingthere is a tradeoff between accuracy and diversity because techniques. In particular, traditional recommender systemshigh accuracy may often be obtained by safely recommend- typically rank the relevant items in a descending order http://ieeexploreprojects.blogspot.coming to users the most popular items, which can clearly lead of their predicted ratings for each user and then recommendto the reduction in diversity, i.e., less personalized top N items, resulting in high accuracy. In contrast, therecommendations , , . And conversely, higher proposed approaches consider additional factors, such asdiversity can be achieved by trying to uncover and item popularity, when ranking the recommendation list torecommend highly idiosyncratic or personalized items for substantially increase recommendation diversity whileeach user, which often have less data and are inherently maintaining comparable levels of accuracy. This papermore difficult to predict, and, thus, may lead to a decrease provides a comprehensive empirical evaluation of thein recommendation accuracy. proposed approaches, where they are tested with various Table 1 illustrates an example of accuracy and diversity data sets in a variety of different settings. For example, thetradeoff in two extreme cases where only popular items or best results show up to 20-25 percent diversity gain withlong-tail type items are recommended to users, using only 0.1 percent accuracy loss, up to 60-80 percent gain withMovieLens rating data set (data sets used in this paper are 1 percent accuracy loss, and even substantially higherdiscussed in Section 5.1). In this example, we used a popular diversity improvements (e.g., up to 250 percent) if somerecommendation technique, i.e., neighborhood-based colla- users are willing to tolerate higher accuracy loss.borative filtering (CF) technique , to predict unknown In addition to providing significant diversity gains, theratings. Then, as candidate recommendations for each user, proposed ranking techniques have several other advanta-we considered only the items that were predicted above the geous characteristics. In particular, these techniques arepredefined rating threshold to assure the acceptable level of extremely efficient, because they are based on scalableaccuracy, as is typically done in recommender systems. sorting-based heuristics that make decisions based only onAmong these candidate items for each user, we identified the the “local” data (i.e., only on the candidate items of eachitem that was rated by most users (i.e., the item with the individual user) without having to keep track of the “global”largest number of known ratings) as a popular item, and the information, such as which items have been recommendeditem that was rated by least number of users (i.e., the item across all users and how many times. The techniques are alsowith the smallest number of known ratings) as a long-tail parameterizable, since the user has the control to choose theitem. As illustrated by Table 1, if the system recommends acceptable level of accuracy for which the diversity will beeach user the most popular item (among the ones that had a maximized. Also, the proposed ranking techniques provide asufficiently high predicted rating), it is much more likely for flexible solution to improving recommendation diversitymany users to get the same recommendation (e.g., the best because: they are applied after the unknown item ratingsselling item). The accuracy measured by precision-in-top-1 have been estimated and, thus, can achieve diversity gains inmetric (i.e., the percentage of truly “high” ratings among conjunction with a number of different rating predictionthose that were predicted to be “high” by the recommender techniques, as illustrated in the paper; as mentioned above,system) is 82 percent, but only 49 popular items out of the vast majority of current recommender systems already
898 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012employ some ranking approach, thus, the proposed techni- approaches provide a flexible solution, as mentionedques would not introduce new types of procedures into earlier: they do not introduce any new procedures intorecommender systems (they would replace existing ranking the recommendation process and also can be used inprocedures); the proposed ranking approaches do not conjunction with any available rating estimation algorithm.require any additional information about users (e.g., demo- In our experiments, to illustrate the broad applicability ofgraphics) or items (e.g., content features) aside from the the proposed recommendation ranking approaches, weratings data, which makes them applicable in a wide variety used them in conjunction with the most popular and widelyof recommendation contexts. employed CF techniques for rating prediction: a heuristic The remainder of the paper is organized as follows: neighborhood-based technique and a model-based matrixSection 2 reviews relevant literature on traditional recom- factorization technique.mendation algorithms and the evaluation of recommenda- Before we provide an overview of each technique, wetion quality. Section 3 describes our motivations for introduce some notation and terminology related toalternative recommendation ranking techniques, such as recommendation problem. Let U be the set of users of aitem popularity. We then propose several additional recommender system, and let I be the set of all possibleranking techniques in Section 4, and the main empirical items that can be recommended to users. Then, the utilityresults follow in Section 5. Additional experiments are function that represents the preference of item i 2 I by userconducted to further explore the proposed ranking techni- u 2 U is often defined as R : U Â I ! Rating, where Ratingques in Section 6. Lastly, Section 7 concludes the paper by typically represents some numeric scale used by the users tosummarizing the contributions and future directions. evaluate each item. Also, in order to distinguish between the actual ratings and the predictions of the recommender system, we use the Rðu; iÞ notation to represent a known2 RELATED WORK rating (i.e., the actual rating that user u gave to item i), and2.1 Recommendation Techniques for Rating the RÃ ðu; iÞ notation to represent an unknown rating (i.e., Prediction the system-predicted rating for item i that user u has notRecommender systems are usually classified into three rated before).categories based on their approach to recommendation:content-based, collaborative, and hybrid approaches , . 2.1.1 Neighborhood-Based CF TechniqueContent-based recommender systems recommend items There exist multiple variations of neighborhood-based CFsimilar to the ones the user preferred in the past. techniques , , . In this paper, to estimate RÃ (u, i),Collaborative filtering recommender systems recommend i.e., the rating that user u would give to item i, we firstitems that users with similar preferences (i.e., “neighbors”) compute the similarity between user u and other users u0 http://ieeexploreprojects.blogspot.comhave liked in the past. Finally, hybrid approaches can using a cosine similarity metric , :combine content-based and collaborative methods in P 0several different ways. Recommender systems can also be 0 i2Iðu;u0 Þ Rðu; iÞ Á Rðu ; iÞclassified based on the nature of their algorithmic technique simðu; u Þ ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃqP P ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ; ð1Þ 2 0 2into heuristic (or memory-based) and model-based ap- i2Iðu;u0 Þ Rðu; iÞ i2Iðu;u0 Þ Rðu ; iÞproaches , . Heuristic techniques typically calculate where Iðu; u0 Þ represents the set of all items rated by bothrecommendations based directly on the previous user 0activities (e.g., transactional data or rating values). One of user u and user u . Based on the similarity calculation, setthe commonly used heuristic techniques is a neighborhood- NðuÞ of nearest neighbors of user u is obtained. The size ofbased approach that finds nearest neighbors that have set N(u) can range anywhere from 1 to jUj À 1, i.e., all other Ãtastes similar to those of the target user , , , , users in the data set. Then, R ðu; iÞ is calculated as the 0. In contrast, model-based techniques use previous user adjusted weighted sum of all known ratings Rðu ; iÞ, where 0activities to first learn a predictive model, typically using u 2 NðuÞ , :some statistical or machine-learning methods, which is then P 0 0 0used to make recommendations. Examples of such techni- Ã u0 2NðuÞ simðu; u ÞÁðRðu ; iÞ À Rðu ÞÞ R ðu; iÞ ¼ RðuÞ þ P 0 : ð2Þques include Bayesian clustering, aspect model, flexible u0 2NðuÞ jsimðu; u Þjmixture model, matrix factorization, and other methods ,, , , , . Here, RðuÞ represents the average rating of user u. In real-world settings, recommender systems generally A neighborhood-based CF technique can be user-basedperform the following two tasks in order to provide or item-based, depending on whether the similarity isrecommendations to each user. First, the ratings of unrated calculated between users or items. Formulas (1) and (2)items are estimated based on the available information represent the user-based approach, but they can be(typically using known user ratings and possibly also straightforwardly rewritten for the item-based approachinformation about item content or user demographics) because of the symmetry between users and items in allusing some recommendation algorithm. And second, the neighborhood-based CF calculations . In our experi-system finds items that maximize the user’s utility based on ments, we used both user-based and item-based approachesthe predicted ratings, and recommends them to the user. for rating estimation.Ranking approaches proposed in this paper are designed toimprove the recommendation diversity in the second task of 2.1.2 Matrix Factorization CF Techniquefinding the best items for each user. Matrix factorization techniques have been the mainstay of Because of the decomposition of rating estimation and numerical linear algebra dating back to the 1970s , ,recommendation ranking tasks, our proposed ranking  and have recently gained popularity in recommender
ADOMAVICIUS AND KWON: IMPROVING AGGREGATE RECOMMENDATION DIVERSITY USING RANKING-BASED TECHNIQUES 899systems applications because of their effectiveness in (threshold for “high” ratings, denoted by TH ) as “highlyimproving recommendation accuracy , , , . ranked” and the ratings less than 3.5 as “nonhighlyMany variations of matrix factorization techniques have ranked.” Furthermore, in real-world settings, recommenderbeen developed to solve the problems of data sparsity, systems typically recommend the most highly rankedoverfitting, and convergence speed, and they turned out to N items since users are usually interested in only severalbe a crucial component of many well-performing algorithms most relevant recommendations, and this list of N items forin the popular Netflix Prize1 competition , , , , user u can be defined as LN ðuÞ ¼ fi1 ; . . . ; iN g, where, , . We implemented the basic version of this RÃ ðu; ik Þ ! TH for all k 2 f1; 2; ::; Ng. Therefore, in ourtechnique, as presented in . With the assumption that a paper, we evaluate the recommendation accuracy based on the percentage of truly “highly ranked” ratings, denoteduser’s rating for an item is composed of a sum of preferences by correctðLN ðuÞÞ, among those that were predicted to beabout the various features of that item, this model is induced the N most relevant “highly ranked” items for each user,by Singular Value Decomposition (SVD) on the user-item i.e., using the popular precision-in-top-N metric . Theratings matrix. In particular, using K features (i.e., rank-K metric can be written formally asSVD), user u is associated with a user-factor vector pu (the X Xuser’s preferences for K features), and item i is associatedwith an item-factor vector qi (the item’s importance weights precision-in-top-N ¼ jcorrectðLN ðuÞÞj jLN ðuÞj; u2U u2Ufor K features). The preference of how much user u likesitem i, denoted by RÃ ðu; iÞ, is predicted by taking an inner where correctðLN ðuÞÞ ¼ fi 2 LN ðuÞjRðu; iÞ ! TH g. How-product of the two vectors, i.e., ever, relying on the accuracy of recommendations alone may not be enough to find the most relevant items for a RÃ ðu; iÞ ¼ pT qi : u ð3Þ user. It has often been suggested that recommender systems must be not only accurate, but also useful , . For All values in user- and item-factor vectors are initially example, McNee et al.  suggest new user-centricassigned to arbitrary numbers and estimated with a simple directions for evaluating recommender systems beyondgradient descent technique as described in (4). User- and the conventional accuracy metrics. They claim that seren-item-factor vectors are iteratively updated with learning dipity in recommendations or user experiences andrate parameter () as well as regularization parameter (), expectations also should be considered in evaluating thewhich is used to minimize overfitting, until the minimum recommendation quality. Among many different aspectsimprovement in predictive accuracy or a predefined that cannot be measured by accuracy metrics alone, in thisnumber of iterations per feature is reached. One learning paper, we focus on the notion of the diversity of recommen-iteration is defined as http://ieeexploreprojects.blogspot.com dations, which is discussed next. For each rating Rðu; iÞ 2.3 Diversity of Recommendations err ¼ Rðu; iÞ À pT qi u As mentioned in Section 1, the diversity of recommendations pu ¼ pu þ ðerr Â qi À Â pu Þ (4) can be measured in two ways: individual and aggregate. qi ¼ qi þ ðerr Â pu À Â qi Þ Most of recent studies have focused on increasing the End For individual diversity, which can be calculated from each user’s Finally, unknown ratings are estimated with the final two recommendation list (e.g., an average dissimilarity betweenvectors pu and qi as stated in (3). More details on variations all pairs of items recommended to a given user) , ,of matrix factorization techniques used in recommender , , . These techniques aim to avoid providing toosystems can be found in , , , , . similar recommendations for the same user. For example, some studies , ,  used an intralist similarity metric2.2 Accuracy of Recommendations to determine the individual diversity. Alternatively, ZhangNumerous recommendation techniques have been devel- and Hurley  used a new evaluation metric, item novelty,oped over the last few years, and various metrics have been to measure the amount of additional diversity that one itememployed for measuring the accuracy of recommendations, brings to a list of recommendations. Moreover, the loss ofincluding statistical accuracy metrics and decision-support accuracy, resulting from the increase in diversity, ismeasures . As examples of statistical accuracy metrics, controlled by changing the granularity of the underlyingmean absolute error (MAE) and root mean squared error similarity metrics in the diversity-conscious algorithms .(RMSE) metrics measure how well a system can predict an On the other hand, except for some work that examinedexact rating value for a specific item. Examples of decision- sales diversity across all users of the system by measuring asupport metrics include precision (the percentage of truly statistical dispersion of sales , , there have been“high” ratings among those that were predicted to be few studies that explore aggregate diversity in recommender“high” by the recommender system), recall (the percentage systems, despite the potential importance of diverseof correctly predicted “high” ratings among all the ratings recommendations from both user and business perspec-known to be “high”), and F-measure, which is a harmonic tives, as discussed in Section 1. Several metrics can be usedmean of precision and recall. In particular, the ratings of the to measure aggregate diversity, including the percentage ofdata sets that we used in our experiments are integers items that the recommender system is able to makebetween 1 and 5, inclusive, where higher value represents a recommendations for (often known as coverage) . Sincebetter liked item. As commonly done in recommender we intend to measure the recommender systems perfor-systems literature, we define the items greater than 3.5 mance based on the top-N recommended items lists that the system provides to its users, in this paper, we use the total 1. More information can be found at www.netflixprize.com. number of distinct items recommended across all users as
900 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012an aggregate diversity measure, which we will refer to asdiversity-in-top-N and formally define as follows:
: u2U Note that the diversity-in-top-N metric can also serve asan indicator of the level of personalization provided by arecommender system. For example, a very low diversity-in-top-N indicates that all users are being recommendedthe same top-N items (low level of personalization),whereas a very high diversity-in-top-N points to the factthat every user receives her own unique top-N items (highlevel of personalization). In summary, the goal of the proposed ranking ap-proaches is to improve the diversity of recommendations;however, as described in Section 1, there is a potential Fig. 1. Performance of the standard ranking approach and item-tradeoff between recommendation accuracy and diversity. popularity-based approach with its parameterized versions.Thus, in this paper, we aim to find techniques that canimprove aggregate diversity of recommendations while criteria are needed in order to achieve diversity improve-maintaining adequate accuracy. ment. Since recommending best selling items to each user typically leads to diversity reduction, recommending less3 MOTIVATIONS FOR RECOMMENDATION popular items intuitively should have an effect toward RERANKING increasing recommendation diversity. And, as seen from the example in Table 1 (in Section 1), this intuition hasIn this section, we discuss how reranking of the candidate empirical support. Following this motivation, we exploreitems whose predictions are above TH can affect the the possibility to use item popularity as a recommendationaccuracy-diversity tradeoff and how various item ranking ranking criterion, and in the next section, we show how thisfactors, such as popularity-based approach, can improve approach can affect the recommendation quality in terms ofthe diversity of recommendations. Note that the general accuracy and diversity.idea of personalized information ordering is not new; e.g., http://ieeexploreprojects.blogspot.comits importance has been discussed in information retrieval 3.2 Proposed Approach: Item-Popularity-Basedliterature , , including some attempts to reduce Rankingredundancy and promote the diversity of retrieved results Item-popularity-based ranking approach ranks items di-by reranking them , , . rectly based on their popularity, from lowest to highest, where popularity is represented by the number of known3.1 Standard Ranking Approach ratings that each item has. More formally, item-popularity-Typical recommender systems predict unknown ratings based ranking function can be written as follows:based on known ratings, using any traditional recommen-dation technique such as neighborhood-based or matrix rankItemPop ðiÞ ¼ jUðiÞj; where UðiÞ ¼ fu 2 U j 9Rðu; iÞg:factorization CF techniques, discussed in Section 2.1. Then, We compared the performance of the item-popularity-the predicted ratings are used to support the user’s decision based ranking approach with the standard ranking ap-making. In particular, each user u gets recommended a list proach using MovieLens data set and item-based CF, and weof top-N items, LN ðuÞ, selected according to some ranking present this comparison using the accuracy-diversity plot incriterion. More formally, item ix is ranked ahead of item iy Fig. 1. In particular, the results show that, as compared to the(i.e., ix 0 iy ) if rankðix Þ rankðiy Þ, where rank: I ! IR is a standard ranking approach, the item-popularity-basedfunction representing the ranking criterion. The vast ranking approach increased recommendation diversity frommajority of current recommender systems use the predicted 385 to 1,395 (i.e., 3.6 times!); however, recommendationrating value as the ranking criterion: accuracy dropped from 89 to 69 percent. Here, despite the Ã À1 rankStandard ðiÞ ¼ R ðu; iÞ : significant diversity gain, such a significant accuracy loss (20 percent) would not be acceptable in most real-life The power of À1 in the above expression indicates that personalization applications. Therefore, next we introducethe items with highest predicted (as opposed to lowest a general technique to parameterize recommendation rank-predicted) ratings RÃ ðu; iÞ are the ones being recommended ing approaches, which allows to achieve significant diversityto user. In the paper, we refer to this as the standard ranking gains while controlling accuracy losses (e.g., according toapproach, and it shares the motivation with the widely used how much loss is tolerable in a given application).probability ranking principle in information retrievalliterature that ranks the documents in order of decreasing 3.3 Controlling Accuracy-Diversity Tradeoff:probability of relevance . Parameterized Ranking Approaches Note that, by definition, recommending the most highly The item-popularity-based ranking approach as well as allpredicted items selected by the standard ranking approach other ranking approaches proposed in this paper (to beis designed to help improve recommendation accuracy, but discussed in Section 4) are parameterized with “rankingnot recommendation diversity. Therefore, new ranking threshold” TR 2 ½TH ; Tmax (where Tmax is the largest possible
ADOMAVICIUS AND KWON: IMPROVING AGGREGATE RECOMMENDATION DIVERSITY USING RANKING-BASED TECHNIQUES 901Fig. 2. General overview of ranking-based approaches for improving recommendation diversity.rating on the rating scale, e.g., Tmax ¼ 5) to allow user the ranking approach. An even higher threshold 4.7 still makesability to choose a certain level of recommendation accuracy. it possible to achieve 20 percent diversity gain (from 385 toIn particular, given any ranking function rankX ðiÞ, ranking 462) with only 0.06 percent of accuracy loss.threshold TR is used for creating the parameterized version Also note that, even when there are less than N itemsof this ranking function, rankX ði; TR Þ, which is formally above the ranking threshold TR , by definition, all the itemsdefined as above TR are recommended to a user, and the remaining 8 http://ieeexploreprojects.blogspot.com according to the standard ranking Ã top-N items are selected rankx ðiÞ; if R ðu; iÞ approach. This ensures that all the ranking approaches 2 ½TR ; Tmax ; proposed in this paper provide the same exact number of rankx ði; TR Þ ¼ u þ rankStandard ðiÞ; if RÃ ðu; iÞ recommendations as their corresponding baseline techni- : 2 ½TH ; TR Þ; ques (the ones using the standard ranking approach), which is very important from the experimental analysis point of where Iu ðTR Þ ¼ fi 2 IjRÃ ðu; iÞ ! TR g; u ¼ max rankx ðiÞ: Ã Ã i2Iu ðTR Þ view as well in order to have a fair performance comparison of different ranking techniques. Simply put, items that are predicted above rankingthreshold TR are ranked according to rankX ðiÞ, while items 3.4 General Steps for Recommendation Rerankingthat are below TR are ranked according to the standard The item-popularity-based ranking approach describedranking approach rankStandard ðiÞ. In addition, all items that above is just one example of possible ranking approachesare above TR get ranked ahead of all items that are below TR for improving recommendation diversity, and a number of(as ensured by u in the above formal definition). Thus, additional ranking functions, rankX ðiÞ, will be introducedincreasing the ranking threshold TR 2 ½TH ; Tmax toward in Section 4. Here, based on the previous discussion inTmax would enable choosing the most highly predicted items Section 3, we summarize the general ideas behind theresulting in more accuracy and less diversity (becoming proposed ranking approaches, as illustrated by Fig. 2.increasingly similar to the standard ranking approach); in The first step, shown in Fig. 2a, represents the standardcontrast, decreasing the ranking threshold TR 2 ½TH ; Tmax approach, which, for each user, ranks all the predicted itemstoward TH would make rankX ði; TR Þ increasingly more according to the predicted rating value and selects top-Nsimilar to the pure ranking function rankX ðiÞ, resulting in candidate items, as long as they are above the highlymore diversity with some accuracy loss. predicted rating threshold TH . The recommendation quality Therefore, choosing different TR values in between the of the overall recommendation technique is measured inextremes allows the user to set the desired balance between terms of the precision-in-top-N and the diversity-in-top-N,accuracy and diversity. In particular, as Fig. 1 shows, the as shown in the accuracy-diversity plot at the right side ofrecommendation accuracy of item-popularity-based rank- the example (a).ing approach could be improved by increasing the ranking The second step, illustrated in Fig. 2b, shows thethreshold. For example, the item-popularity-based ranking recommendations provided by applying one of the proposedapproach with ranking threshold 4.4 could minimize the ranking functions, rankX ðiÞ, where several different itemsaccuracy loss to 1.32 percent, but still could obtain 83 percent (that are not necessarily among N most highly predicted, butdiversity gain (from 385 to 703), compared to the standard are still above TH ) are recommended to the user. This way, a
902 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012Fig. 3. Relationships between various item ranking criteria and predicted rating value, for highly predicted ratings (MovieLens data).user can get recommended more idiosyncratic, long-tail, less used as alternatives to rankStandard to improve recommenda-frequently recommended items that may not be as widely tion diversity, and the formal definitions of each rankingpopular, but can still be very relevant to this user (as approach (provided below) are illustrated in Fig. 3 withindicated by relatively high predicted rating). Therefore, empirical evidence. Because of the space limitations, in this http://ieeexploreprojects.blogspot.comreranking the candidate items can significantly improve the section we show the empirical results from MovieLens datarecommendation diversity although, as discussed, this set; however, consistently similar patterns were found intypically comes at some loss of recommendation accuracy. other data sets (discussed in Section 5.1) as well.The performance graph of the second step (b) demonstrates In particular, in our empirical analysis, we consistentlythis accuracy-diversity tradeoff. observed that popular items, on average, are likely to have The third step, shown in Fig. 2c, can significantly higher predicted ratings than less popular items, using bothminimize accuracy loss by confining the reranked recom- heuristic- and model-based techniques for rating prediction,mendations to the items above newly introduced ranking as shown in Fig. 3a. As discussed in Section 3, recommend-threshold TR (e.g., 3.8 out of 5). In this particular illustration, ing less popular items helps to improve recommendationnote that the increased ranking threshold makes the fifth diversity; therefore, as can be immediately suggested fromrecommended item in step (b) (i.e., item with predicted the monotonic relationship between average item popular-rating value of 3.65) filtered out and the next possible item ity and predicted rating value, recommending not as highlyabove the new ranking threshold (i.e., the item predicted as predicted items (but still predicted to be above TH ) likely3.81) is recommended to user u. Averaged across all users, implies recommending, on average, less popular items,this parameterization helps to make the level of accuracy potentially leading to diversity improvements. Therefore,loss fairly small with still a significant diversity gain (as we propose to use predicted rating value itself as an itemcompared to the standard ranking approach), as shown in ranking criterion:the performance graph of step (c). We now introduce several additional item ranking . Reverse Predicted Rating Value, i.e., ranking thefunctions, and provide empirical evidence that supports candidate (highly predicted) items based on theirour motivation of using these item criteria for diversity predicted rating value, from lowest to highest (as aimprovement. result choosing less popular items, according to Fig. 3a). More formally:4 ADDITIONAL RANKING APPROACHES rankRevPred ðiÞ ¼ RÃ ðu; iÞ:In many personalization applications (e.g., movie or book We now propose several other ranking criteria thatrecommendations), there often exist more highly predicted exhibit consistent relationships to predicted rating value,ratings for a given user than can be put in her top-N list. including average rating, absolute likeability, relative like-This provides opportunities to have a number of alternative ability, item rating variance, and neighbors’ rating variance,ranking approaches, where different sets of items can as shown in Figs. 3b, 3c, 3d, 3e, and 3f. In particular, thepossibly be recommended to the user. In this section, we relationship between predicted rating values and theintroduce six additional ranking approaches that can be average actual rating of each item (as explicitly rated by
ADOMAVICIUS AND KWON: IMPROVING AGGREGATE RECOMMENDATION DIVERSITY USING RANKING-BASED TECHNIQUES 903users), shown in Fig. 3b, also supports a similar conjecture TABLE 2that items with lower average rating, on average, are more Basic Information of Movie Rating Data Setslikely to have lower predicted rating values (likelyrepresenting less popular items, as shown earlier). Thus,such items could be recommended for better diversity. . Item Average Rating, i.e., ranking items according to an average of all known ratings for each item: 1 X rankAvgRating ðiÞ ¼ RðiÞ; where RðiÞ ¼ Rðu; iÞ: jUðiÞj u2UðiÞ Similarly, the relationship between predicted ratingvalues and item absolute (or relative) likeability, shown inFigs. 3c and 3d, also suggests that the items with lowerlikeability, on average, are more likely to have lowerpredicted rating values (likely representing less popular . Neighbors’ Rating Variance, i.e., ranking itemsmovies) and, thus, could be recommended for better according to the rating variance of neighbors of adiversity. particular user for a particular item. The closest neighbors of user u among the users who rated the . Item Absolute Likeability, i.e., ranking items accord- particular item i, denoted by u0 , are chosen from the ing to how many users liked them (rated the item set of UðiÞ NðuÞ. above TH ): rankNeighborVar ðiÞ rankAbsLike ðiÞ ¼ jUH ðiÞj; 1 X where UH ðiÞ ¼ fu 2 UðiÞjRðu; iÞ ! TH g: ¼ ðRðu0 ; iÞ À Ru ðiÞÞ2 ; jUðiÞ NðuÞj u0 2ðUðiÞNðuÞÞ . Item Relative Likeability, i.e., ranking items accord- 1 X ing to the percentage of the users who liked an item where Ru ðiÞ ¼ Rðu0 ; iÞ: (among all users who rated it): jUðiÞ NðuÞj u0 2ðUðiÞNðuÞÞ http://ieeexploreprojects.blogspot.com rankRelLike ðiÞ ¼ jUH ðiÞj=jUðiÞj: In summary, there exist a number of different ranking We can also use two different types of rating variances to approaches that can improve recommendation diversity by recommending items other than the ones with topmostimprove recommendation diversity. With any traditional predicted rating values to a user. In addition, as indicatedrecommendation technique, each item’s rating variance in Fig. 1, the degree of improvement (and, more impor-(which can be computed from known ratings submitted for tantly, the degree of tolerable accuracy loss) can bethat item) can be used for reranking candidate items. Also, controlled by the chosen ranking threshold value TR . Theif any neighborhood-based recommendation technique is next section presents comprehensive empirical resultsused for prediction, we can use the rating variance of demonstrating the effectiveness and robustness of theneighbors whose ratings are used to predict the rating for proposed ranking techniques.reranking candidate items. As shown in Figs. 3e and 3f, therelationship between the predicted rating value and each 5 EMPIRICAL RESULTSitem’s rating variance and the relationship between pre-dicted rating value and 50 neighbors’ rating variance 5.1 Dataobtained by using a neighborhood-based CF technique The proposed recommendation ranking approaches weredemonstrate that highly predicted items tend to be low in tested with several movie rating data sets, includingboth item rating variance and neighbors’ rating variance. In MovieLens (data file available at grouplens.org), Netflixother words, among the highly predicted ratings (i.e., above (data file available at netflixprize.com), and Yahoo! MoviesTH ), there is more user consensus for higher predicted items (individual ratings collected from movie pages at movies.than for lower predicted ones. These findings indicate that yahoo.com). We preprocessed each data set to include usersreranking recommendation list by rating variance and and movies with significant rating history, which makes itchoosing the items with higher variance could improve possible to have sufficient number of highly predicted itemsrecommendation diversity. for recommendations to each user (in the test data). The basic statistical information of the resulting data sets is . Item Rating Variance, i.e., ranking items according summarized in Table 2. For each data set, we randomly to each item’s rating variance (rating variance of chose 60 percent of the ratings as training data and used users who rated the item): them to predict the remaining 40 percent (i.e., test data). 1 X 5.2 Performance of Proposed Ranking Approaches rankItemVar ðiÞ ¼ ðRðu; iÞ À RðiÞÞ2 : jUðiÞj u2UðiÞ We conducted experiments on the three data sets described in Section 5.1, using three widely popular recommendation
904 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012techniques for rating prediction, including two heuristic- prediction techniques, and different data sets, as shown inbased (user-based and item-based CF) and one model- Tables 3a, 3b, and 3c.based (matrix factorization CF) techniques, discussed in In general, all proposed ranking approaches were able toSection 2.1. All seven proposed ranking approaches were provide significant diversity gains, and the best performingused in conjunction with each of the three rating prediction ranking approach may be different depending on thetechniques to generate top-N (N ¼ 1, 5, 10) recommenda- chosen data set and rating prediction technique. Thus,tions to each user on each data set, with the exception of system designers have the flexibility to choose the mostneighbors’ variance-based ranking of model-based pre- desirable ranking approach based on the data in a givendicted ratings. In particular, because there is no concept of application. We would also like to point out that, since theneighbors in a pure matrix factorization technique, the proposed approaches essentially are implemented as sort-ranking approach based on neighbors’ rating variance was ing algorithms based on certain ranking heuristics, they areapplied only with heuristic-based techniques. We set extremely scalable. For example, it took, on average, lesspredicted rating threshold as TH ¼ 3:5 (out of 5) to ensure than 6 seconds to rank all the predicted items and selectthat only relevant items are recommended to users, and top-N recommendations for nearly 3,000 users in ourranking threshold TR was varied from 3.5 to 4.9. The experiments with MovieLens data.performance of each ranking approach was measured in 5.3 Robustness Analysis for Different Parametersterms of precision-in-top-N and diversity-in-top-N (N ¼ 1, In this section, we present robustness analysis of the5, 10), and, for comparison purposes, its diversity gain and proposed techniques with respect to several parameters:precision loss with respect to the standard ranking number of neighbors used in heuristic-based CF, number ofapproach was calculated. features used in matrix factorization CF, number of top-N Consistently with the accuracy-diversity tradeoff dis- recommendations provided to each user, the value ofcussed in Section 1, all the proposed ranking approaches predicted rating threshold T , and the level of data sparsity. Himproved the diversity of recommendations by sacrificing We tested the heuristic-based technique with a differentthe accuracy of recommendations. However, with each number of neighbors (15, 20, 30, and 50 neighbors) and theranking approach, as ranking threshold TR increases, the model-based technique with a different number of featuresaccuracy loss is significantly minimized (smaller precision (K ¼ 8, 16, 32, and 64). For illustration purposes, Figs. 4aloss) while still exhibiting substantial diversity improve- and 4b show how two different ranking approaches for bothment. Therefore, with different ranking thresholds, one can heuristic-based and model-based rating prediction techni-obtain different diversity gains for different levels of ques are affected by different parameter values. Whiletolerable precision loss, as compared to the standard different parameter values may result in slightly different http://ieeexploreprojects.blogspot.comranking approach. Following this idea, in our experiments, performance (as is well known in recommender systemswe compare the effectiveness (i.e., diversity gain) of literature), the fundamental behavior of the proposeddifferent recommendation ranking techniques for a variety techniques remains robust and consistent, as shown inof different precision loss levels (0.1-10 percent). Figs. 4a and 4b. In other words, using the recommendation While, as mentioned earlier, a comprehensive set of ranking techniques with any of the parameter values, it isexperiments was performed using every rating prediction possible to obtain substantial diversity improvements withtechnique in conjunction with every recommendation only a small accuracy loss.ranking function on every data set for different number of We also vary the number of top-N recommendationstop-N recommendations; the results were very consistent provided by the system. Note that, while it is intuitivelyacross all experiments and, therefore, for illustration clear that top-1, top-5, and top-10 recommendations willpurposes and because of the space limitations, we show provide different accuracy and diversity levels (i.e., it isonly three results: each using all possible ranking techniques much easier to accurately recommend one relevant itemon a different data set, a different recommendation than relevant 10 items, and it is much easier to have moretechnique, and a different number of recommendations. aggregate diversity when you can provide more recommen-(See Table 3.) dations), again we observe that, with any number of top-N For example, Table 3a shows the performance of the recommendations, the proposed techniques exhibit robustproposed ranking approaches used in conjunction with and consistent behavior, i.e., they allow to obtain substantialitem-based CF technique to provide top-5 recommendations diversity gains at a small accuracy loss, as shown in Fig. 4c.on the MovieLens data set. In particular, one can observe For example, with only 1 percent precision loss, we were able to increase the diversity from 133 to 311 (134 percentthat, with the precision loss of only 0.001 or 0.1 percent (i.e., gain) using the reverse predicted rating value-based rankingwith precision of 0.891, down from 0.892 of the standard approach in the top-1 recommendation task, and from 385 toranking approach), item average rating-based ranking 655 (70 percent gain) using the item-popularity-basedapproach can already increase recommendation diversity ranking approach in the top-5 recommendation task.by 20 percent (i.e., absolute diversity gain of 78 on top of the In addition, our finding that the proposed ranking385 achieved by the standard ranking approach). If users approaches help to improve recommendation diversity iscan tolerate precision loss up to 1 percent (i.e., precision of also robust with respect to the “highly predicted” rating0.882 or 88.2 percent), the diversity could be increased by threshold value TH . In particular, with a different thresh-81 percent with the same ranking technique; and 5 percent old, the baseline recommendation accuracy and diversityprecision loss (i.e., 84.2 percent) can provide diversity gains of the standard ranking approach could be very different,up to 189 percent for this recommendation technique on this and the number of actual recommendations that aredata set. Substantial diversity improvements can be ob- produced by the system (i.e., in case there is a limitedserved across different ranking techniques, different rating number of items that are predicted higher than the
ADOMAVICIUS AND KWON: IMPROVING AGGREGATE RECOMMENDATION DIVERSITY USING RANKING-BASED TECHNIQUES 905 TABLE 3 Diversity Gains of Proposed Ranking Approaches for Different Levels of Precision Loss http://ieeexploreprojects.blogspot.comNote: Precision Loss ¼ [Precision-in-top-N of proposed ranking approach] À [Precision-in-top-N of standard ranking approach]. Diversity Gain(column 1) ¼ [Diversity-in-top-N of proposed ranking approach] À [Diversity-in-top-N of standard ranking approach]. Diversity Gain (column 2) ¼[Diversity-in-top-N of proposed ranking approach] / [Diversity-in-top-N of standard ranking approach].minimum threshold) may change. However, again we than nonpersonalized reranking approaches (such as basedobserve the same consistent ability of the proposed on item popularity).ranking approaches to achieve substantial diversity gains Finally, the data sets we used for our experiments (seewith only a small accuracy loss. For example, as shown in Table 2) were obtained using a specific sampling (prepro-Table 4, with a different predicted rating threshold (i.e., cessing) strategy—by choosing items and users with largestTH ¼ 4:5) and 1 percent precision loss, we could obtain number of ratings (i.e., strategy of top users and top items,68 percent diversity gain by ranking the recommendations described as Data 3 in Table 5), which resulted in relativelybased on item average rating in top-1 recommendationtask on MovieLens data set using item-based CF for rating dense rating data sets. Thus, for robustness analysis, weprediction. Similar improvements were observed for other generated sparser data sets (Data 1 and 2 in Table 5) from thedata sets and rating prediction techniques as well. Also original MovieLens data set by applying different samplingnote that there is an implicit natural assumption of strategies that have been used in prior literature . Table 5recommender systems selectivity that is associated with summarizes the basic characteristics of these resulting datasome reranking approaches, i.e., the assumption that sets, including the strategies for choosing items and users.recommender systems will use some reasonably high Fig. 5 illustrates the impact of data sparsity on thevalue of TH which substantially narrows the set of possible recommendation results using one of the proposed rerank-recommendations to only the relevant items for each user. ing approaches as an example (i.e., average rating). MoreIf recommender systems are not selective (i.e., if a huge importantly, as shown in Fig. 5, the behavior of the proposednumber of items are considered relevant to each user),then personalized reranking approaches (such as based on reranking techniques remains consistent with different datareverse predicted rating value) would retain better ability sampling approaches, i.e., it is possible to obtain diversityto provide more aggregate diversity in recommendations improvements with only a small accuracy loss.
906 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012Fig. 4. Performance of the proposed ranking approaches with different parameters. TABLE 4 Performance of Proposed Ranking Approaches with a Different Predicted Rating Threshold (TH ¼ 4:5) http://ieeexploreprojects.blogspot.com6 DISCUSSION AND ADDITIONAL ANALYSIS compare some of the proposed ranking approaches with this random ranking approach in Fig. 6. For example, asIn this section, we explore and discuss several additional shown in Fig. 6a, the random ranking approach increasedissues related to the proposed ranking approaches. the diversity from 385 to 596 (55 percent gain) with6.1 Random Ranking Approach 1 percent precision loss using heuristic-based CF techniqueAs mentioned earlier, the vast majority of traditional on MovieLens data set. While this gain was not as big asrecommender systems adopt the standard ranking ap- the diversity gain of the average rating-based approachproach that ranks the candidate items according to theirpredicted rating values and, thus, recommends to users thetopmost highly predicted items. As discussed in Section 3, TABLE 5since the more highly predicted items, on average, tend to MovieLens Data Sets with Different Sampling Strategiesbe among the more popular items, using this rankingapproach will often result in lower recommendationdiversity. While the proposed ranking approaches improvethe diversity by considering alternative item rankingfunctions, such as item popularity, we also found thatreranking the candidate items even at random can providediversity improvements as compared to the standardranking approach. Here, we defined the random ranking as RankRandom ðiÞ ¼ Randomð0; 1Þ;where Random(0,1) is a function that generates uniformlydistributed random numbers in the [0, 1] interval. We
ADOMAVICIUS AND KWON: IMPROVING AGGREGATE RECOMMENDATION DIVERSITY USING RANKING-BASED TECHNIQUES 907Fig. 5. Diversity gains with sparse data sets.(80 percent gain), it actually outperformed the neighbors’rating variance-based approach (35 percent gain). Asanother example, as shown in Fig. 6b, with only 0.5 percentprecision loss on Netflix data set using model-based CFtechnique, the random ranking approach produced theresults that were almost as good (27 percent diversity gain)as several best performing ranking approaches (i.e.,30 percent gain for the reverse predicted rating-basedapproach or 33 percent gain for the relative likeability-based approach). This provides a valuable insight that, if the goal is toimprove recommendation diversity (without significant Fig. 6. Diversity gain of the random ranking approach with differentaccuracy loss), even a random recommendation ranking levels of precision loss.approach can significantly outperform the traditional and http://ieeexploreprojects.blogspot.comwidely used standard ranking approach (based on the bestsellers and not long-tail items . However, as shownpredicted rating value). Furthermore, as illustrated in in Fig. 7, the proposed ranking approaches are able toFigs. 6a and 6b, the random ranking approach works recommend significantly more long-tail items with a smallconsistently well with different data sets and in conjunc- level of accuracy loss, and this distribution becomes evention with different CF techniques. more skewed toward long-tail items if more accuracy loss6.2 Impact of Proposed Ranking Approaches on the can be tolerated. For example, with 1 percent precision loss, Distribution of Recommended Items the percentage of recommended long-tail items increasedSince we measure recommendation diversity as the total from 16 to 21 percent with neighbors’ rating variance-basednumber of distinct items that are being recommended ranking approach, or to 32 percent with item popularity andacross all users, one could possibly argue that, while the item absolute likeability-based approaches. And with 2.5 ordiversity can be easily improved by recommending a few 5 percent precision loss, the proportion of long-tail itemsnew items to some users, it may not be clear whether the can grow up to 43 and 58 percent, respectively (e.g., usingproposed ranking approaches would be able to shift the item-popularity ranking technique).overall distribution of recommended items toward more This analysis provides further empirical support to theidiosyncratic, “long-tail” recommendations. Therefore, in fact that the proposed ranking approaches increase notthis section, we explore how the proposed ranking just the number of distinct items recommended, but alsoapproaches change the actual distribution of recommended the proportion of recommended long-tail items, thus,items in terms of their popularity. Following the popular confirming that the proposed techniques truly contribute“80-20 rule” or the Pareto principle, we define the top- toward more diverse and idiosyncratic recommendations20 percent of the most frequently rated items in the training across all users.data set as “bestsellers” and the remaining 80 percent of Furthermore, in addition to the distributional analysisitems as “long-tail” items. We calculated the percentage of based on the simple proportion of the long-tail items, welong-tail items among the items recommended across all have also used three more sophisticated metrics: entropyusers by the proposed ranking approaches as well as by thestandard ranking approach. The results are shown in Fig. 7. , Gini coefficient , and Herfindahl index . All of For example, with the standard ranking approach, the these measures provide different ways of measuringlong-tail items consist of only 16 percent of total recom- distributional dispersion of recommended items across allmendations (i.e., 84 percent of recommendations were of users, by showing the degree to which recommendationsbestsellers) when recommending top-5 items to each user are concentrated on a few popular items (i.e., low diversity)using item-based CF technique on MovieLens data set, or are more equally spread out across all candidate itemsconfirming some findings in prior literature that recom- (i.e., high diversity). In particular, the entropy-basedmender systems often gravitate toward recommending diversity metric Entropy-Diversity is calculated as
908 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012 X n nþ1Ài recðiÞ Gini-Diversity ¼ 2 Â ; i¼1 nþ1 total X recðiÞ2 n Herfindahl-Diversity ¼ 1 À : i¼1 total Importantly, Fig. 8 (the top three graphs) demonstrates that all three distributional inequality metrics are very highly correlated with our diversity-in-top-N metric for our various ranking-based approaches. This means that our proposed ranking techniques do not just manipulate our simple diversity-in-top-N metric to increase the number of different items among the recommendations, but also fundamentally change the distribution of recommended items toward more evenly distributed representation. Fig. 8 (the bottom three graphs) also shows that the rankingFig. 7. Proportion of long-tail items among recommended items. Note: approaches exhibit similar patterns of diversity gains (or,Percentage of Long-Tail Items = Percentage of recommended items that more generally, of the accuracy-diversity tradeoff) usingare not among top-20 percent most popular items. these more sophisticated distributional inequality metrics. This provides an additional confirmation that the proposed X recðiÞ recðiÞ n reranking techniques truly contribute toward more diverse Entropy-Diversity ¼ À ln ; i¼1 total total and idiosyncratic recommendations across all users.where rec(i) is the number of users who got recommended 6.3 Global Optimization-Based Approachitem i, n is the total number of candidate items that were While the focus of this paper is on ranking-based techniques for improving recommendation diversity, weavailable for recommendation, and total is the total number would also like to discuss the feasibility of the more direct,of top-N recommendations made across all users (i.e., global optimization approach to deal with this problem.total ¼ NjUj). We also use the original Gini coefficient, Intuitively, one can conceptualize our problem as a multi-which is a commonly used measure of wealth distribution criteria optimization problem, where the system should provide recommendations that are as diverse and asinequality , to calculate the Gini-Diversity metric, and http://ieeexploreprojects.blogspot.com Because of the inherent tradeoff accurate as possible.the original Herfindahl index , also known as the between these two metrics, we can use a common approachconcentration index, to calculate the Herfindahl-Diversity to solving multicriteria optimization problems that opti-metric. However, we reverse the scale for the two metrics mizes only one of the criteria and converts the others to constraints. In particular, given some diversity metric d andfor more intuitiveness (i.e., so that smaller values represent the target diversity level D, we can search the space of alllower diversity and larger values higher diversity). As a possible top-N recommendation configurations for all usersresult, these metrics are calculated as until an optimal configuration is found. Here, everyFig. 8. Using distributional diversity metrics: correlation with diversity-in-top-N and performance of the proposed approaches.
ADOMAVICIUS AND KWON: IMPROVING AGGREGATE RECOMMENDATION DIVERSITY USING RANKING-BASED TECHNIQUES 909recommendation configuration LN consists of jUj sets of compared to the standard ranking approach that rankstop-N items (one set LN ðuÞ for each user u), and the optimal recommended items based on their predicted rating value.recommendation configuration maximizes the sum of Therefore, another interesting topic for future researchpredicted rating values (as a proxy for high predictive would be to explore possibilities to improve both accuracyaccuracy) while having diversity value at least D: and diversity. X X Based on the findings described in this paper, a possible max RÃ ðu; iÞ; subject to dðLN Þ ! D: approach to improving both the accuracy and diversity of LN u2U i2LN ðuÞ the standard technique would be to modify the proposed Because of the sheer size of the solution space and recommendation reranking techniques, which are alreadypotential complexity of some diversity metrics d (e.g., such known to produce diversity gains, in a way that increasesas the distributional diversity metrics discussed in Sec- their accuracy. Perhaps counterintuitively, one of thetion 6.2), this problem formulation based on the global possible ways to do this involves recommending feweroptimization is not practical for many real-world applica- items. In particular, the parameterized versions of thetions, e.g., Netflix, which has 100,000s of users, each of proposed ranking techniques use threshold TR to differ-which can possibly be recommended 10,000s of movies. entiate the items that should be ranked by the proposedWhile the global optimization-based approach can promise technique from the ones to be ranked by the standardthe highest predictive accuracy for the given level of ranking technique, as discussed in Section 3.3. However, TRdiversity, this approach would be prohibitive even forapplications that are much smaller than Netflix’s. can be used not only for ranking, but also for filtering Note that, for the simpler diversity measures, such as purposes, i.e., by updating the parameterized rankingdiversity-in-top-N which simply counts the number of function as follows:different items that appear in the recommendations, it may rankx ðiÞ; if RÃ ðu; iÞ 2 ½TR ; Tmax ;be feasible to design efficient optimization-based heuristics. rankx ði; TR Þ ¼ Remove item; if RÃ ðu; iÞ 2 ½TH ; TR Þ:For example, one such heuristic for the diversity-in-top-Nmetric could be: find the most frequently recommended This will recommend only items that are predicted toitem imax ; replace one recommendation of imax for some user be not only above TH , but above TR as well (wherewho has another item inew among the candidate items that always TR ! TH ), consequently improving the recommen-has not appeared in any recommendations; and repeat until dation accuracy.there are no such replacements possible. Our exploration of While the comprehensive exploration of this phenomen-several such simple heuristics has resulted in the following on is beyond the scope of this paper, in Fig. 9, we illustratetwo insights. First, our proposed ranking approaches based how the item-popularity-based ranking approach can be http://ieeexploreprojects.blogspot.comonly on local information (i.e., where recommendation modified using the above-mentioned strict filtering policydecision can be made within each user, by performing a to improve upon the standard approach both in terms ofsimple sort on a single numeric value) are significantly accuracy and diversity. As Fig. 9 demonstrates, item-faster than even such simple global heuristics that have tokeep track which items have been recommended across all popularity-based ranking approach with TR ¼ 4:1 (out ofusers and how many times. And second, while these global 5) generates only 56.6 percent of all possible itemheuristics perform well for the simple diversity-in-top-N recommendations that could be obtained from standardmetric, they are “overspecialized” for it, i.e., they do not ranking approach (because the recommendations withfundamentally change the distribution of recommended predicted rating 4:1 were removed). Interestingly, how-movies toward more even representation. In particular, for ever, despite the smaller number of recommendations, thisa given level of accuracy loss, while the optimization-based ranking approach increased the recommendation accuracyheuristics were able to achieve better diversity-in-top-N as by 4.6 percent (from 83.5 to 88.1 percent) and diversity bycompared to our proposed ranking techniques, they 70 items or 7.8 percent (from 881 to 951). As shown in Fig. 9,performed worse according to the distributional diversity using different TR values allows to produce differentmetrics (entropy, Gini, and Herfindahl), which further accuracy and diversity gains.emphasizes the robustness of the proposed techniques that As discussed above, this approach would not be able todemonstrated substantial improvements according to both provide all N recommendations for each user, but it never-types of diversity measures. theless may be useful in the cases where system designers Because the focus of this paper is on ranking-based need the flexibility to apply other recommendation strategiestechniques, the comprehensive exploration of optimization- to fill out the remaining top-N item slots. For example, somebased approaches is beyond the scope of this paper. recommender systems may want to adopt “exploration-However, the development of fast and robust optimiza- versus-exploitation” strategy , where some of the recom-tion-based approaches that are capable of improving mendations are tailored directly toward the user’s tastes andrecommendation diversity while maintaining an acceptable preferences (i.e., exploitation), and the proposed rankinglevel of accuracy constitutes an interesting issue for future techniques with strict filtering can be used to fill out this partresearch. of the recommendation list for each user (providing both6.4 Improving both Accuracy and Diversity: accuracy and diversity benefits over the standard approach). Recommending Fewer Items Meanwhile, the remaining recommendations can be de-Empirical results in this paper consistently show that the signed to learn more about the user (i.e., exploration), e.g.,proposed ranking approaches can obtain significant diver- using active learning techniques , , so that the systemsity gains (with a small amount of accuracy loss) as can make better recommendations to the users in the future.
910 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012 important step in this line of research. Furthermore, exploration of recommendation diversity when recom- mending item bundles  or sequences  (instead of individual items) also constitutes interesting topics for future research. In summary, we hope that this paper will stimulate further research on improving recommendation diversity and other aspects of recommendation quality. ACKNOWLEDGMENTS The research reported in this paper was supported in part by the US National Science Foundation (NSF) grant IIS-0546443. REFERENCESFig. 9. Improving both accuracy and diversity of recommendations (in  G. Adomavicius and A. Tuzhilin, “Toward the Next Generation ofparentheses: percentage of possible recommendations generated). Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions,” IEEE Trans. Knowledge and Data Eng., vol. 17, no. 6, pp. 734-749, June 2005.7 CONCLUSIONS AND FUTURE WORK  C. Anderson, The Long Tail. Hyperion, 2006.  M. Balabanovic and Y. Shoham, “Fab: Content-Based, Collabora-Recommender systems have made significant progress in tive Recommendation,” Comm. ACM, vol. 40, no. 3, pp. 66-72,recent years and many techniques have been proposed to 1997.improve the recommendation quality. However, in most  R. Bell, Y. Koren, and C. Volinsky, “The BellKor Solution to thecases, new techniques are designed to improve the accuracy Netflix Prize,” www.netflixprize.com/assets/ProgressPrize2007_ KorBell.pdf, 2007.of recommendations, whereas the recommendation diver-  R.M. Bell, Y. Koren, and C. Volinsky, “The Bellkor 2008 Solution tosity has often been overlooked. In particular, we showed the Netflix Prize,” http://www.research.att.com/~volinsky/that, while ranking recommendations according to the netflix/ProgressPrize2008BellKorSolution.pdf, 2008.predicted rating values (which is a de facto ranking standard  J. Bennett and S. Lanning, “The Netflix Prize,” Proc. KDD-Cup and Workshop at the 13th ACM SIGKDD Int’l Conf. Knowledge and Datain recommender systems) provides good predictive accu- Mining, 2007.racy, it tends to perform poorly with respect to recommen-  D. Billsus and M. Pazzani, “Learning Collaborative Information http://ieeexploreprojects.blogspot.com Machine Learning, 1998.dation diversity. Therefore, in this paper, we proposed a Filters,” Proc. Int’l Conf.number of recommendation ranking techniques that can  K. Bradley and B. Smyth, “Improving Recommendation Diver- sity,” Proc. 12th Irish Conf. Artificial Intelligence and Cognitiveprovide significant improvements in recommendation Science, 2001.diversity with only a small amount of accuracy loss. In  S. Breese, D. Heckerman, and C. Kadie, “Empirical Analysis ofaddition, these ranking techniques offer flexibility to system Predictive Algorithms for Collaborative Filtering,” Proc. 14th Conf.designers, since they are parameterizable and can be used Uncertainty in Artificial Intelligence, 1998.  E. Brynjolfsson, Y.J. Hu, and D. Simester, “Goodbye Paretoin conjunction with different rating prediction algorithms Principle, Hello Long Tail: The Effect of Search Costs on the(i.e., they do not require the designer to use only some Concentration of Product Sales,” Management Science, vol. 57,specific algorithm). They are also based on scalable sorting- no. 8, pp. 1373-1386, 2011.based heuristics and, thus, are extremely efficient. We  E. Brynjolfsson, Y. Hu, and M.D. Smith, “Consumer Surplus in the Digital Economy: Estimating the Value of Increased Productprovide a comprehensive empirical evaluation of the Variety at Online Booksellers,” Management Science, vol. 49, no. 11,proposed techniques and obtain consistent and robust pp. 1580-1596, 2003.diversity improvements across multiple real-world data  J. Carbonell and J. Goldstein, “The User of MMR, Diversity-Basedsets and using different rating prediction techniques. Reranking for Reordering Documents and Producing Summa- ries,” Proc. ACM Conf. Research and Development in Information This work gives rise to several interesting directions for Retrieval (SIGIR), pp. 335-336, 1998.future research. In particular, additional important item  J. Delgado and N. Ishii, “Memory-Based Weighted-Majorityranking criteria should be explored for potential diversity Prediction for Recommender Systems,” Proc. ACM SIGIR Work-improvements. This may include consumer-oriented or shop Recommender Systems: Algorithms and Evaluation, 1999.  D. Fleder and K. Hosanagar, “Blockbuster Culture’s Next Rise ormanufacturer-oriented ranking mechanisms , depend- Fall: The Impact of Recommender Systems on Sales Diversity,”ing on the given application domain, as well as external Management Science, vol. 55, no. 5, pp. 697-712, 2009.factors, such as social networks . Also, as mentioned  S. Funk, “Netflix Update: Try This at Home” http://sifter.org/ ~simon/journal/20061211.html, 2006.earlier, optimization-based approaches could be used to  K.R. Gabriel and S. Zamir, “Lower Rank Approximation ofachieve further improvements in recommendation diver- Matrices by Least Squares with Any Choice of Weights,”sity, although these improvements may come with a Technometrics, vol. 21, pp. 489-498, 1979.(possibly significant) increase in computational complexity.  R. Garfinkel, R. Gopal, A. Tripathi, and F. Yin, “Design of a Shopbot and Recommender System for Bundle Purchases,”Moreover, because of the inherent tradeoff between the Decision Support Systems, vol. 42, no. 3, pp. 1974-1986, 2006.accuracy and diversity metrics, an interesting research  A. Ghose and P. Ipeirotis, “Designing Novel Review Rankingdirection would be to develop a new measure that captures Systems: Predicting Usefulness and Impact of Reviews,” Proc.both of these aspects in a single metric. In addition, user Ninth Int’l Conf. Electronic Commerce (ICEC), 2007. Gini, “Measurement of Inequality and Incomes,” The Economicstudies exploring users’ perceptions and acceptance of the  C. vol. 31, pp 124-126, 1921. J.,diversity metrics as well as the users’ satisfaction with  D.G. Goldstein and D.C. Goldstein, “Profiting from the Longdiversity-sensitive recommender systems would be an Tail,” Harvard Business Rev., vol. 84, no. 6, pp. 24-28, June 2006.