Successfully reported this slideshow.

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Clustering Technique for Collaborative Filtering Recommendation and Application to Venue Recommendation

  1. 1. Clustering Techniques for Collaborative Filtering and the Application to Venue Recommendation Manh Cuong Pham , Yiwei Cao, Ralf Klamma Information Systems and Database Technology RWTH Aachen, Germany Graz , Austria, September 01, 2010 I-KNOW 2010
  2. 2. Agenda <ul><li>Introduction </li></ul><ul><li>Clustering techniques for collaborative filtering </li></ul><ul><li>Case study: venue recommendation </li></ul><ul><ul><li>Data sets: DBLP and CiteSeerX </li></ul></ul><ul><ul><li>User-based </li></ul></ul><ul><ul><li>Item-based </li></ul></ul><ul><li>Conclusions and Outlook </li></ul>
  3. 3. Introduction <ul><li>Recommender systems: help users dealing with information overload </li></ul><ul><li>Components of a recommender system [ Burke2002 ] </li></ul><ul><ul><li>Set of users, set of items (products) </li></ul></ul><ul><ul><li>Implicit/explicit user rating on items </li></ul></ul><ul><ul><li>Additional information: trust, collaboration, etc. </li></ul></ul><ul><ul><li>Algorithms for generating recommendations </li></ul></ul><ul><li>Recommendation techniques [ Adomavicius and Tuzhilin 2005 ] </li></ul><ul><ul><li>Collaborative Filtering (CF) [Breese et al. 1998 ] </li></ul></ul><ul><ul><ul><li>Memory-based algorithms: user-based, item-based [Sarwar 2001] </li></ul></ul></ul><ul><ul><ul><li>Model-based algorithms: Bayesian network [ Breese1998 ] ; Clustering [ Ungar 1998 ] ; Rule-based [ Sarwar2000 ] ; Machine learning on graphs [Zhou 2005, 2008]; PLSA [Hofmann 1999] ; Matrix factorization [Koren 2009] </li></ul></ul></ul><ul><ul><li>Content-based recommendation [Sarwar et al. 2001] </li></ul></ul><ul><ul><li>Hybrid approaches [Burke 2002] </li></ul></ul>
  4. 4. Clustering and Collaborative Filtering Cluster 2 Cluster 1 item-based CF User clustering Item clustering item-based CF item-based CF <ul><li>Problems: large-scale data; sparse rating matrix; </li></ul><ul><li>diversity of users and items </li></ul><ul><li>Previous approaches: Clustering based on ratings </li></ul><ul><ul><li>K-means, Metis, etc. [Rashid 2006, Xue 2005, O’Connor 2001] </li></ul></ul><ul><li>Our approach </li></ul><ul><ul><li>Clustering based on additional information: relationships between users, items </li></ul></ul><ul><ul><li>Improvement on both efficiency and accuracy </li></ul></ul>x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x
  5. 5. Evaluation: Venue Recommendation <ul><li>Recommend venues (conferences, journals, workshops) to researchers </li></ul><ul><li>User-based CF </li></ul><ul><ul><li>Populate user-item matrix using venue participation history </li></ul></ul><ul><ul><li>Ratings: normalized venue publication counts </li></ul></ul><ul><ul><li>User-clustering: co-authorship network </li></ul></ul><ul><li>Item-based CF </li></ul><ul><ul><li>Similarity between venues based on citation </li></ul></ul><ul><ul><li>Similarity measure: cosine </li></ul></ul><ul><ul><li>Venue clustering: similarity network </li></ul></ul>
  6. 6. Data Sets <ul><li>DBLP ( </li></ul><ul><ul><li>788,259 author’s names </li></ul></ul><ul><ul><li>1,226,412 publications </li></ul></ul><ul><ul><li>3,490 venues (conferences, workshops, journals) </li></ul></ul><ul><li>CiteSeerX ( </li></ul><ul><ul><li>7,385,652 publications (including publications in reference lists) </li></ul></ul><ul><ul><li>22,735,240 citations </li></ul></ul><ul><ul><li>Over 4 million author’s names </li></ul></ul><ul><li>Combination </li></ul><ul><ul><li>Canopy clustering [ McCallum 2000 ] </li></ul></ul><ul><ul><li>Result: 864,097 matched pairs </li></ul></ul><ul><ul><li>On average: venues cite 2306 and </li></ul></ul><ul><ul><li>are cited 2037 times </li></ul></ul>
  7. 7. User-based CF: Author Clustering <ul><li>Data: DBLP </li></ul><ul><li>Perform 2 test cases for the years of 2005 and 2006 </li></ul><ul><ul><li>Clustering of co-authorship networks </li></ul></ul><ul><ul><li>2005s network: 478,108 nodes; 1,427,196 edges </li></ul></ul><ul><ul><li>2006s network: 544,601 nodes; 1,686,867 edges </li></ul></ul><ul><ul><li>Prediction of the venue participation </li></ul></ul><ul><li>Clustering algorithm </li></ul><ul><ul><li>Density-based algorithm [Clauset 2004 ] </li></ul></ul><ul><ul><li>Obtained modularity: 0.829 and 0.82 </li></ul></ul><ul><li>Cluster size distribution follows Power law </li></ul>
  8. 8. User-based CF: Performance <ul><li>Precisions for 1000 random chosen authors </li></ul><ul><li>Precisions computed at 11 standard recall levels 0%, 10%,….,100% </li></ul><ul><li>Results </li></ul><ul><ul><li>Clustering performs better </li></ul></ul><ul><ul><li>Not significant improved </li></ul></ul><ul><ul><li>Better efficiency </li></ul></ul><ul><li>Further improvement </li></ul><ul><ul><li>Different networks: citation </li></ul></ul><ul><ul><li>Overlapping clustering </li></ul></ul>
  9. 9. Item-based CF: Venue Network Creation and Clustering <ul><li>Knowledge network </li></ul><ul><ul><li>Aggregate bibliography coupling counts at venue level </li></ul></ul><ul><ul><li>Undirected graph G(V, E) , where V : venues, E : edges weighted by cosine similarity </li></ul></ul><ul><ul><li>Threshold: </li></ul></ul><ul><ul><li>Clustering: density-based algorithm [ Neuman 2004, Clauset 2004 ] </li></ul></ul><ul><ul><li>Network visualization: force-directed paradigm [ Fruchterman 1991 ] </li></ul></ul><ul><li>Knowledge flow network (for venue ranking, see Pham & Klamma 2010 ) </li></ul><ul><ul><li>Aggregate bibliography coupling counts at venue level </li></ul></ul><ul><ul><li>Threshold: citation counts >= 50 </li></ul></ul><ul><ul><li>Domains from Microsoft Academic Search ( </li></ul></ul>
  10. 10. Knowledge Network: the Visualization
  11. 11. Knowledge Network: Clustering
  12. 12. Interdisciplinary Venues: Top Betweenness Centrality
  13. 13. High Prestige Series: Top PageRank
  14. 14. Conclusions and Future Research <ul><li>Clustering and recommender systems </li></ul><ul><ul><li>Advantage of using additional information for clustering </li></ul></ul><ul><ul><li>Application of clustering for both user-based and item-based CF </li></ul></ul><ul><ul><li>Key issue: impact of the communities (cluster) on the quality of recommendations; non-overlapping communities vs. overlapping communities </li></ul></ul><ul><li>Outlook </li></ul><ul><ul><li>Further evaluation: trust networks clustering, paper and potential collaborator recommendation </li></ul></ul><ul><ul><li>Datasets: Epinion,, etc. </li></ul></ul><ul><ul><li>Digital libraries in Web 2.0: Mendeley, ResearchGate, etc. </li></ul></ul>

Editor's Notes

  • Pham Manh Cuong
  • ×