Improving Collaborative Filtering Based Recommenders Using Topic Modelling

940 views
828 views

Published on

Improving Collaborative Filtering Based Recommenders Using Topic Modelling

Published in: Data & Analytics
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
940
On SlideShare
0
From Embeds
0
Number of Embeds
27
Actions
Shares
0
Downloads
39
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Improving Collaborative Filtering Based Recommenders Using Topic Modelling

  1. 1. 1confidentialFlytxt. All rights reserved. 18 August 201418 August 2014© Improving Collaborative Filtering Based Recommenders using Topic Modelling Jobin Wilson1, Santanu Chaudhury2 & Brejesh Lall2 1 R&D Department, Flytxt, Trivandrum, India 2 Dept. of Electrical Engineering, IIT Delhi, India
  2. 2. 2confidentialFlytxt. All rights reserved. 18 August 201418 August 2014© Agenda Recommender Systems Overview Proposed Approach Experiments & Results Conclusion
  3. 3. 3confidentialFlytxt. All rights reserved. 18 August 201418 August 2014© Recommender Systems Overview Information filtering technique to locate products/services/ information that is relevant and exciting to users based on historical preferences; utilizing “wisdom of the crowds” E.g. Editorial, Aggregates (top views, top downloads, recent), Personalized recommendations Formally : U = set of Users I = set of Items Utility function F: Relates U to I through a rating R ; E.g. 0-5 stars, a real number Task : For each user, estimate her preference for items that are yet unseen by her, given all the existing user ratings.
  4. 4. 4confidentialFlytxt. All rights reserved. 18 August 201418 August 2014© Design Matrix Sparse matrix representing each user’s preference for each item. Algorithms need to predict values for empty cells based on available cell values Denser the matrix, better the quality of recommendations User | Item i1 i2 i3 i4 i5 u1 r12 r14 r15 u2 r21 r22 r25 u3 r32 r34 u4 r43 r45
  5. 5. 5confidentialFlytxt. All rights reserved. 18 August 201418 August 2014© Taxonomy of Recommender Algorithms Collaborative Filtering (CF) Based • Neighborhood Based (e.g. User Based, Item Based) • Latent Factor Based (e.g. SVD; factorization of user-item rating matrix to determine latent properties of users and items) Content Based • Constructing user profile from history and matching content profiles to the learned user profile Hybrid • Combining multiple approaches to improve quality of recommendations
  6. 6. 6confidentialFlytxt. All rights reserved. 18 August 201418 August 2014© Proposed Approach Intuition • In many domains, considerable contextual data in text form is available, describing items being recommended (e.g. movies, e-commerce) • Standard CF algorithms do not consider latent properties of users/items which may be influencing a user’s rating decision on items • Discovering such latent properties of users/items help to address sparsity problem as similarity calculation is possible even if there aren’t any overlapping ratings among a pair of users Approach Overview • Discover user profiles in a latent topic space by leveraging contextual data in text form and user’s historical ratings. • Build a hybrid neighborhood based on a similarity score considering latent topic space similarity as well as rating overlap based similarity • Recommend items yet unpicked by the user based on their popularity within the hybrid neighborhood
  7. 7. 7confidentialFlytxt. All rights reserved. 18 August 201418 August 2014© Rating Data Rating based Nbhd. Generate Hybrid User Neighborhood Hybrid User Neighborhood Based Recommender Topic similarity based Nbhd. Top-K un-rated items in Nbhd Collect IR-Stats Extract Item Text Descriptions Contextual Data collection & Pre- processing Generate Item txt files (e.g. plot + genre for movies) LDA on Item txt files Item-Topic and User-Topic distributions Process Flow
  8. 8. 8confidentialFlytxt. All rights reserved. 18 August 201418 August 2014© Topic Modelling and LDA LDA is a generative probabilistic model for analyzing large discrete datasets such as text corpora Each documents is represented as a random mixture of topics which are latent Each topic is represented a distribution of words Documents & words within are observable; Model has to come up with document- topic distributions and topic-word distributions E.g. “Yesterdays cricket match was good, we played well.” => More of “Sports”, less of “Negotiation”
  9. 9. 9confidentialFlytxt. All rights reserved. 18 August 201418 August 2014© Discovering User Profile in Latent Topic Space Load the matrix I determined by the item-topic distribution vectors corresponding to all item documents into memory For each user U, lookup & load the list of items that she has expressed interest on, into a list L. Initialize the current user U’s topic distribution vector to zeros. For each item i in L (each item that current user has expressed interest on), Add the topic distribution vector for i, multiplied by U’s rating normalized by sum of all ratings from U, into U’s topic-distribution vector Persist U’s topic-distribution vector as her user profile. Summary : Add up item-topic distributions multiplied by normalized user-rating, corresponding to each user’s interests, to generate each user’s topic-distribution vector, which indicates his user profile in the latent topic space
  10. 10. 10confidentialFlytxt. All rights reserved. 18 August 201418 August 2014© User Similarity Functions Latent Topic Space Similarity, Rating Overlap Based Similarity We could use standard Pearson-correlation similarity or cosine similarity as well Hybrid Similarity Function,
  11. 11. 11confidentialFlytxt. All rights reserved. 18 August 201418 August 2014© Datasets for Our Experiments Movies.dat - movielens ratings.dat - movielens Plot.list - IMDB Movielens 1M dataset (6040 users and 3706 movies) Subset of Netflix Dataset with 2M ratings (5000 users and 17770 movies) Metadata from IMDB interfaces & OMDBAPI
  12. 12. 12confidentialFlytxt. All rights reserved. 18 August 201418 August 2014© Topic - Keyword Distribution Sample - Movielens-IMDB T26 wayne bruce batman gotham T27 show comedy television network T28 car truck steal run accident T29 lives drama relationship childhood T30 love romance marry girl T31 prison escape jail T32 ship crew island sea T33 hospital suicide doctor psychiatrist T34 plane airport flight rescue T35 wife husband affair sexual T36 friend private insurance T37 coach player basketball winning T38 drama death accident lonely T39 comedy great farm T40 company business career working T41 find brothers gang members T42 family daughter drama home T43 professor scientist research doctor T44 good home mind change T45 war army vietnam nazi T46 london paris england french T47 friendship relationship T48 night friends party stay T49 fi sci planet alien T0 black men man white T1 apartment women boyfriend T2 time kill train buddy deal T3 story drama history real stories T4 day face fate actions prove led T5 father son mother child T6 secret agent fbi government thriller T7 british american indian africa T8 children animation adventures musical T9 job girlfriend worse T10 evil king princess magic prince fantasy T11 money fortune hard million T12 police crime drug mafia T13 documentary music band stage T14 horror death mysterious ghost T15 bond james british agent cia T16 comedy kids vacation T17 murder police thriller detective T18 brother make academy lassard T19 drama lawyer court attorney T20 world group action battle T21 town local people T22 priest god church angel T23 york city manhattan phone T24 school friends girl college T25 house home hill mansion
  13. 13. 13confidentialFlytxt. All rights reserved. 18 August 201418 August 2014© Sample Item-Topic Distributions Discovered Tomorrow Never DiesSchindler's List
  14. 14. 14confidentialFlytxt. All rights reserved. 18 August 201418 August 2014© Sample User Profiles Discovered Movie User : 5989Movie User : 5988
  15. 15. 15confidentialFlytxt. All rights reserved. 18 August 201418 August 2014© Results : Movielens 1M Standard Item Based CF performs the worst with precision values way less than even 1% Standard User Based CF generates precision values less than 5% Proposed HUNR performs the best with precision value at 5 to be more than 31% Recall at 30 indicates HUNR is able to retrieve > 25% of relevant items where as standard User Based CF is only able to retrieve < 5% of the relevant items F-measure analysis also ascertains that HUNR significantly outperforms standard CF techniques
  16. 16. 16confidentialFlytxt. All rights reserved. 18 August 201418 August 2014© 0.0000 0.0500 0.1000 0.1500 0.2000 0.2500 0.3000 0.3500 5 10 20 30 50 75 Precision K UBCF(LL) UBCF(P) IBCF(LL) IBCF(P) HUNR UTNR 0.0000 0.0500 0.1000 0.1500 0.2000 0.2500 0.3000 0.3500 0.4000 0.4500 5 10 20 30 50 75 Recall K UBCF(LL) UBCF(P) IBCF(LL) IBCF(P) HUNR UTNR 0.0000 0.0500 0.1000 0.1500 0.2000 0.2500 5 10 20 30 50 75 F-measure K UBCF(LL) UBCF(P) IBCF(LL) IBCF(P) HUNR UTNR
  17. 17. 17confidentialFlytxt. All rights reserved. 18 August 201418 August 2014© Results : Netflix 2M Standard Item Based CF performs the worst with precision values way less than even 1% Standard User Based CF is generating precision values around 10% where as proposed HUNR performed the best with precision value at 5 to be > 38%. Recall at 75 indicates that HUNR is able to retrieve around 24% of the relevant items where as standard User Based CF is able to retrieve < 9% of the relevant items F-measure analysis also indicates that HUNR performs much better compared to standard CF
  18. 18. 18confidentialFlytxt. All rights reserved. 18 August 201418 August 2014© 0.0000 0.0500 0.1000 0.1500 0.2000 0.2500 0.3000 0.3500 0.4000 0.4500 5 10 20 30 50 75 Precision K UBCF(LL) UBCF(P) IBCF(LL) HUNR UTNR 0.0000 0.0500 0.1000 0.1500 0.2000 0.2500 0.3000 5 10 20 30 50 75 Recall K UBCF(LL) UBCF(P) IBCF(LL) HUNR UTNR 0.0000 0.0500 0.1000 0.1500 0.2000 0.2500 0.3000 5 10 20 30 50 75 F-measure K UBCF(LL) UBCF(P) IBCF(LL) HUNR UTNR
  19. 19. 19confidentialFlytxt. All rights reserved. 18 August 201418 August 2014© Conclusion We proposed a novel hybrid recommender approach using LDA, utilizing similarity of users in a latent topic space along with rating overlap based similarity to refine neighborhood formation for improving quality of recommendations. Empirical evaluations indicate that the technique is well suited for recommender domains having contextual data available in text form, describing items being recommended Proposed approach significantly outperform standard CF algorithms which make use of rating data alone for generating recommendations.
  20. 20. 20confidentialFlytxt. All rights reserved. 18 August 201418 August 2014© References [1] R. Burke, “Hybrid recommender systems: Survey and experiments,” User modeling and user-adapted interaction, vol. 12, no. 4, pp. 331–370, 2002. [2] J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl, “An algorithmic framework for performing collaborative filtering,” in Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 1999, pp. 230–237. [3] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Item-based collaborative filtering recommendation algorithms,” in Proceedings of the 10th international conference on World Wide Web. ACM, 2001, pp. 285–295. [4] Q. Liu, E. Chen, H. Xiong, C. H. Ding, and J. Chen, “Enhancing collaborative filtering by user interest expansion via personalized ranking,” Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 42, no. 1, pp. 218– 233, 2012. [5] T.-M. Chang and W.-F. Hsiao, “Lda-based personalized document recommendation,” 2013. [6] Y. Koren, “Factorization meets the neighborhood: a multifaceted collaborative filtering model,” in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining.ACM, 2008, pp. 426–434. [7] P. Lops, M. de Gemmis, and G. Semeraro, “Content-based recommender systems: State of the art and trends,” in Recommender Systems Handbook. Springer, 2011, pp. 73–105. [8] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” the Journal of machine Learning research, vol. 3, pp. 993–1022, 2003. [9] Y. Zhang, A. Ahmed, V. Josifovski, and A. Smola, “Taxonomy discovery for personalized recommendation.” [10] D. H. Stern, R. Herbrich, and T. Graepel, “Matchbox: large scale online bayesian recommendations,” in Proceedings of the 18th international conference on World wide web. ACM, 2009, pp. 111–120. [11] T. Dunning, “Accurate methods for the statistics of surprise and coincidence,” Computational linguistics, vol. 19, no. 1, pp. 61–74, 1993. [12] D. Lee, “Personalized recommendations based on usersinformation-centered social networks,” Ph.D. dissertation, University of Pittsburgh,2013. [13] IMDb., “Internet movie database:,” February 2014. [Online]. Available: http://www.imdb.com/interfaces [14] B. Fritz., “The open movie database api:,” February 2014. [Online]. Available: http://www.omdbapi.com/ [15] J. Riedl and J. Konstan, “Movielens dataset,” 1998. [16] A. K. McCallum, “Mallet: A machine learning for language toolkit,” 2002, http://mallet.cs.umass.edu. [17] S. Owen, R. Anil, T. Dunning, and E. Friedman, Mahout in action. Manning, 2011.
  21. 21. 21confidentialFlytxt. All rights reserved. 18 August 201418 August 2014© Thank You www.flytxt.com
  22. 22. 22confidentialFlytxt. All rights reserved. 18 August 201418 August 2014© Backup Slides
  23. 23. 23confidentialFlytxt. All rights reserved. 18 August 201418 August 2014© LDA – As a General Graphical Model

×