A survey of memory based methods for collaborative filtering based techniques


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A survey of memory based methods for collaborative filtering based techniques

  1. 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME366A SURVEY OF MEMORY BASED METHODS FOR COLLABORATIVEFILTERING BASED TECHNIQUES FOR ONLINE RECOMMENDERSYSTEMSAnuj Verma1, Kishore Bhamidipati21(Dept. of Computer Science and Engineering, Manipal Institute of Technology, ManipalUniversity, Manipal, Karnataka - 576104, India)2(Asst. Professor - Sr. Scale, Dept. of Computer Science and Engineering, Manipal Instituteof Technology, Manipal University, Manipal, Karnataka - 576104, India)ABSTRACTThe cyberspace aims at providing an increasingly dynamic experience to users. Therise of electronic commerce has led to efforts for providing a highly efficient and qualitativeexperience to the consumer. Recommender Systems are a step in this direction. They aid inunderstanding the unlimited amount of data available and in particularly knowing each user.One of the most flourishing techniques to generate recommendations is Collaborativefiltering. The technique focuses on using available information about existing users togenerate prediction for the active user. A widely employed approach for the purpose is thememory based algorithm. The existing preferences of a user are represented in form of a user-item matrix. The method makes use of the complete or partial user-item matrix in order toisolate the nearest users for the active user and then generate the prediction. The majority ofinitial efforts dedicated to understanding electronic commerce and recommender systemsconcentrate only on the technical aspects like algorithm building and computational needs ofsuch systems. Not much attention has been provided to questions pertaining to the need ofsuch systems or how effective they are at what they try to perform. Along with looking at thevarious stages corresponding to a memory based collaborative filtering system, we proposean experiment to check the effectiveness of predictions or ratings generated by such systems.Keywords: Collaborative Filtering, Collective Intelligence, Memory Based Algorithm,Recommendation SystemINTERNATIONAL JOURNAL OF COMPUTER ENGINEERING& TECHNOLOGY (IJCET)ISSN 0976 – 6367(Print)ISSN 0976 – 6375(Online)Volume 4, Issue 2, March – April (2013), pp. 366-372© IAEME: www.iaeme.com/ijcet.aspJournal Impact Factor (2013): 6.1302 (Calculated by GISI)www.jifactor.comIJCET© I A E M E
  2. 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME3671. INTRODUCTIONRecommender Systems prove to be an important tool for market analysis as well asdeeper understanding of customer behavior. This is more relevant in current time ofinformation explosion when a customer gets confounded by the variety of options that areavailable concerning every issue. They try to bridge the gap between the user and the marketby mathematically determining what a user may prefer.The uniqueness of collaborative filtering lies with its miscellany. The technique isunchanged for any type of data, i.e., the working of the system remains the same irrespectiveof the nature of information, so the system structure is same for any application- from a bookrecommendation system to a movie recommendation system.The collaborative filtering technique can be utilized by two different approaches-Memory-based and Model-based. Memory based collaborative- filtering systems use thecomplete user-item rating matrix or a part of it to generate recommendations. The Model-based approach attempts to determine a pattern or trend in the given ratings data and thenconstruct a model to generate recommendations [1].Memory-based approach has been discussed at length and is predominantly utilized incommercial systems due to several factors. The first reason is its ease of use. Since itconcentrates on the user item database, it is easier to apply and account for. The secondreason is its intuitive nature. As the system keeps collecting data about a particular user, itspontaneously acts to generate recommendations after considering this new information.Hence the predictions are always up-to-date. The third reason is the cost. They are less costlyand hence outperform the other approach in speed and resource usage [2].The first limitation of this approach is that it is rating dependent. The behavioraltrends or taste of a user may change over time. The user can also get resistive during therating process and may selectively or incorrectly rate items. Another factor is the limitedscope of ratings. Data belonging to a particular domain can be used to successfully generatepredictions for that specific domain only. It is difficult to generate a prediction about thebreakfast preferences of a user after analyzing the music that user hears. The secondlimitation is data sparsity. When a new user is introduced to the system it takes time to builda profile for him as no information exists about him. This is called the cold start problem [2].We observe the various stages of a general collaborative filtering algorithm and thetry to analyze the effectiveness of various techniques that are employed for the same.2. COLLABORATIVE FILTERING ALGORITHMA Recommender System can be imagined as a black box used as a filter ofinformation. The input is the data gathered about a user (active user). One of the mostimportant algorithms that functions inside the filter is the similarity computation algorithmthat aims to determine the proximity of different users and represent this ‘nearness’ in formof a numerical weight. This weight can be any measure that can be used to determine‘nearness’ between two entities. Euclidean distance and Correlation Coefficient are widelyused measures. Angular Distance can also be used. The output of the filter is the generatedrecommendation. The three main stages of the process are: Representation of Input Data,Similarity Computation and Recommendation Generation [3].
  3. 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME3682.1 Representation of Input dataData about a specific user can be gathered in explicit and implicit methods. Theexplicit method involves collecting ratings or asking about likes and dislikes of a consumer(Thumbs Up/Thumbs Down buttons). Implicit method is concerned with checking thebrowsing history, tracking the number of clicks and recording time spent on a particular page.The gathered data is represented in the form of the user-item matrix [4].2.2 Similarity ComputationSimilarity Computation is the most important step of the Recommendation Systembecause the accuracy of these calculations determines the accuracy of the system. This step isconcerned with identifying the knearest users to the active user. These k users form theneighborhood of the active user. The rating is generated keeping in mind the neighborhood ofthe active user [3]. The different methods to calculate the similarity are:2.2.1 Euclidean DistanceThis method takes into account all the items users have rated in common andrepresents them on the axes for a graph. The users are then represented as points on the graphand the distance between the different points is measured using the Euclidean distanceformula. The distance between users A and B can be represented as follows:w(A,B) = ∑ ඥሺ‫ܣ‬௜ െ ‫ܤ‬௜ሻଶ௡௜ୀଵ (1)where ‫ܣ‬௜ and ‫ܤ‬௜are the ratings for the ݅௧௛item of users A and B respectively, who have atotal of n co-rated items. A disadvantage of this method is the two dimensional nature of themeasure despite its simplicity. The range of this measure is [0,1]. [5]2.2.2 Pearson CorrelationPearson Correlation Coefficient is a widely used statistical measure used to checkhow strongly two entities are related. It determines the degree of association between twovariables. The nearer the points are to a linear trajectory, the higher their strength ofassociation. The Pearson Correlation between users A and B can be represented as follows[6]:w(A,B) =∑ ሺ஺೔ି஺ҧሻሺ஻೔ି஻തሻ೙೔సభට∑ ሺ஺೔ି஺ҧሻమ೙೔సభ ට∑ ሺ஻೔ି஻തሻమ೙೔సభ(2)where ‫ܣ‬௜ and ‫ܤ‬௜are the ratings for the ݅௧௛item of users A and B respectively, who have atotal of n co-rated items. ‫ܣ‬ҧ and ‫ܤ‬ത are the average ratings for user A and user B respectively.Unlike the Euclidean distance, it has a wider range [-1,1] and also assumes negative values.Its strength lies in the fact that it can also accommodate any form of scaling and can correctfor any non-normalized nature of data.2.2.3 Cosine SimilarityVector based cosine similarity is an important technique used for string matching andin checking the similarity of two documents. It can be suitably applied to the cause ofcollaborative filtering as well- if the users are considered as documents to be matched, itemsare considered as words and the ratings are considered to be the frequency of the occurrence
  4. 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME369of words. By using this measure we are trying to establish the angle between the two vectors[6]. The cosine similarity between two users A and B can be represented as follows:w(A,B) =∑ ሺ஺೔೙೔సభ ௑ ஻೔ሻට∑ ሺ஺೔ሻమ೙భೕసభ ට∑ ሺ஻೔ሻమ೙మೕసభ(3)where ‫ܣ‬௜ and ‫ܤ‬௜are the ratings for the ݅௧௛item of users A and B respectively, who have atotal of n co-rated items. n1 and n2 denote the rated items for user A and user B respectively.It has a range from [-1,1]. Cosine of 0 is 1 which indicates that vectors are overlapping, henceindicating that users have similar tastes. This measure is particularly useful when data issparse or the co-rated items are few and useful relationship cannot be determined using othermeasures.2.3 Prediction GenerationOnce the task of computation of similarity is completed and a suitable neighborhoodis formed, the generation of prediction is performed. The task can be accomplished usingvarious methods the most trivial of which is taking a simple average or mean of the obtainedratings. A more efficient method is to take the weighted average of the available ratings. Therating for a particular item k for user A can be represented as follows [6]:‫݌‬஺,௞ = ‫ܣ‬ҧ ൅ ݇ ∑ ‫ݓ‬ሺ‫,ܣ‬ ‫ݍ‬ሻ௡௤ୀଵ ሺ‫ݍ‬௞ െ ‫ݍ‬ത) (4)where ‫ܣ‬ҧ is the average rating for items rated by user A. ‫ݓ‬ሺ‫,ܣ‬ ‫ݍ‬ሻ is the similarity between userA and neighborq, n is the number of neighbors in the neighborhood. ‫ݍ‬௞ is user q’s rating foritem k and ‫ݍ‬ത is the average rating for all items rated by him.The aim is to calculate the expected rating for all items that have not yet been rated by theactive user and then recommend the N most recommendedhighest rated items in theneighborhood. This is called ‘Top N’ recommendation approach [7].3. EXPERIMENTAL DESIGNThe success of the collaborative filtering based algorithm is dependent on theeffectiveness of similarity computation method used. Hence our task is the evaluation of threedifferent techniques commonly used for memory based collaborative filtering- EuclideanDistance, Pearson Correlation Coefficient and Vector Based Cosine Similarity. An empiricalstudy to calculate the effectiveness of the ratings generated using various similarity measureswas conducted. For our study, we considered explicit input, i.e., numerical ratings given bythe user to examine the system. The dataset used was a popularly used dataset for MovieRecommendation Systems-The MovieLens 100,000 movie ratings dataset (MovieLens is afree service provided by GroupLens Research atthe University of Minnesota).A standard one page questionnaire was prepared containing a list of 100 common moviesbelonging to the dataset and users were asked to provide ratings for the same. The ratingswere collected from 52 users locally as well as through online social networking media. Basicdemographic information about the users was also recorded. For an input form to be valid, auser must have rated at least 15% of the items, i.e., 15 movies out of 100.Participant detailsare as follows:
  5. 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME370• Total Participants = 52• Age Range= 14-48 years• Gender Ratio (Male:Female) = 2:1• Valid Ratings=1188To empirically evaluate the techniques, we will use the process of repeated random subsampling validation. 10% of the collected ratings act as validation data, i.e., 120 ratings from1188 ratings, so that we can generate the corresponding ratings from the system and thencompare the deviation. The validation data subset is generated at random. Rest of the ratingsform one part of the training subset.To arrive at a firm conclusion, this procedure will berepeated thrice for each similarity measure. Mean-centered ratings are used. We are interestedin the relative performance of the three measures. The neighborhood size is fixed at 30. Twousers should have at least 5 co-rated items for similarity to be considered.• Total Valid Collected Ratings = 1188• Collected Ratings to be Tested = ்ܴ௘௦௧ = 10 % of the Valid Collected Ratings = 120• Collected Ratings used to generate predictions by The CF System =்ܴ௥௔௜௡ଵ ൌ 1188 -120 =1068• Already available Ratings for the corresponding items from MovieLens Dataset=்ܴ௥௔௜௡ଶ ൌ 19228• Total ratings used by the CF System= ்ܴ௥௔௜௡ ൌ ்ܴ௥௔௜௡ଵ ൅ ்ܴ௥௔௜௡ଶ ൌ 1068 + 19228= 20296• Therefore, 20296 ratings will be used to generate predictions for 120 ratings.• Neighborhood Size = 30 Nearest Neighbors• The process is to be carried out thrice for 3 techniques, hence Total No of Passes = 3X 3 = 94. RESULTSTo judge the accuracy of the similarity computation technique we consider thefollowing parameters:• Average deviation for the generated ratings. Deviation is the difference between theactual rating and the predicted rating.• Average deviation is measured by calculating MAE (Mean Absolute Error) given by:‫ܧܣܯ‬ሺ݂ሻ ൌ1|்ܴ௘௦௧|ܺ ෍ |‫ݎ‬௜ െ ݂ሺ݅ሻ|௥೔ఢோ೅೐ೞ೟where‫ݎ‬௜ is the rating in ்ܴ௘௦௧and ݂ሺ݅ሻis the corresponding rating generatedusing்ܴ௥௔௜௡.• No. of satisfactory or good predictions, i.e., no. of generated predictions for whichdeviation was <0.5• No. of unsatisfactory or bad predictions, i.e., no. of generated predictions for whichdeviation was >1Following is the data obtained after carrying out the studied design:
  6. 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME371Table 1: Obtained ResultsPass SimilarityTypeR୘ୣୱ୲ R୘୰ୟ୧୬ Rୋୣ୬ TotalDeviationMAE GoodRatingsBadRatings1 PCC 120 20296 107 98.05 0.916 38 392 PCC 120 20296 108 102.89 0.953 36 453 PCC 120 20296 110 95.43 0.868 39 334 ED 120 20296 88 72.55 0.824 28 335 ED 120 20296 96 87.35 0.91 32 376 ED 120 20296 104 87.64 0.843 32 397 CS 120 20296 120 85.49 0.712 49 328 CS 120 20296 120 95.49 0.796 41 369 CS 120 20296 120 85.64 0.714 45 29(Note: PCC- Pearson Correlation Coefficient; ED- Euclidean Distance; CS- CosineSimilarity)An important observation is that the number of ratings generated ܴீ௘௡ ൑ ்ܴ௘௦௧. Thisis because when similarity is computed for an active user, only 30 neighbours are considered.It is not always possible that all items rated by these 30 users will contain all 100 items forwhich ratings have been recorded. Hence, for some ratings, the rating for a particular item fora specific user is left un-generated. Therefore, in calculating MAE, we use ܴீ௘௡ instead of்ܴ௘௦௧.All three performances of a similarity type are then used to measure the mean performance:Table 2: Mean PerformanceSN Similarity Type Avg. MAE Avg. % Good Ratings Avg. % Bad Ratings1 Euclidean Distance 0.859 31.97 37.852 Pearson Correlation 0.912 34.77 36.043 Cosine Similarity 0.741 37.5 26.94As it can be observed, for a slight increase in average MAE, the Pearson CorrelationCoefficient produces a higher percentage of good predictions and a lower percentage of badpredictions than the Euclidean distance measure. It can be thus considered a superior measureout of the two. The Vector Based Cosine Similarity outperforms the other two on all 3parameters. Hence it is the most adept out of all three.It can be articulated that the Memory Based Collaborative filtering algorithm performs therecommendation task with estimable accuracy and precision. It can be thus considered asignificant approach for Collaborative Filtering based recommendation.
  7. 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME3725. CONCLUSIONThe aim of recommender systems is to automate and generate precise predictions. Weresolved to check the effectiveness of a widely used approach for the same- The MemoryBased Algorithm. The most crucial stage in the algorithm is neighborhood formation bysimilarity calculation. So, we checked the effectiveness of commonly used similaritymeasures. The quantitative results of our experiments indicated that Vector Based CosineSimilarity was a more effective similarity measure than Pearson Correlation Coefficient andEuclidean distance based similarity. The memory based algorithm produces practicablepredictions and is thus, an efficacious technique for online recommendation. The possibleextensions include carrying out a similar study after normalizing the ratings (z-scorenormalization can be used for the same) and by varying the similarity weight according to thenumber of corrated items (significance weighting) calculation.REFERENCES[1] B.M. Sarwar, G. Karypis,J.A. Konstan and J. Riedl,Item based collaborative filteringrecommendation algorithms, Proc. 10th International Conference on World Wide Web(WWW ’01), 2001, 285–295.[2] X. Su and T.M. Khoshgoftaar,A Survey of Collaborative Filtering Techniques,Advances in Artificial Intelligence, Hindawi Publishing Corporation, Article ID421425, 2009, 19 pages.[3] E. Vozalis and K.G. Margaritis, Analysis of Recommender Systems’ Algorithms, Proc.6th Hellenic-European Conference on Computer Mathematics and its Applications-HERCMA, 2003/9.[4] D. Militaru and C. Zaharia, A survey of collaborative filtering-based systems foronline recommendation, Proc. 12th International Conference on ElectronicCommerce: Roadmap for the Future of Electronic Business [ICEC ‘10], ACM, NewYork, 2010, 43-47.[5] G. Adomavicius and A. Tuzhilin, Towards the Next Generation of RecommenderSystems: A Survey of the State-of-the-Art and Possible Extensions, IEEE Transactionson Knowledge and Data Engineering, 17(6), 2005, 734-749[6] T. Segaran,Making Recommendations, in Programming Collective Intelligence, (USA:O’Reilly Media, 2007) 7-28.[7] J. Breese, D. Heckerman and C. Kadie, Empirical Analysis of Predictive Algorithms forCollaborative Filtering, Microsoft Research, Redmond, Technical Report MSR-TR-98-12, 1998, 43-52.[8] C.R. Cyril Anthoni and Dr. A. Christy, “Integration of Feature Sets with MachineLearning Techniques for Spam Filtering”, International Journal of ComputerEngineering & Technology (IJCET), Volume 2, Issue 1, 2011, pp. 47 - 52, ISSN Print:0976 – 6367, ISSN Online: 0976 – 6375.[9] Suresh Kumar RG, S.Saravanan and Soumik Mukherjee, “Recommendations forImplementing Cloud Computing Management Platforms using Open Source”,International Journal of Computer Engineering & Technology (IJCET), Volume 3,Issue 3, 2012, pp. 83 - 93, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.