Clustering Technique for Collaborative Filtering Recommendation and Application to Venue Recommendation


Published on

Published in: Education, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Pham Manh Cuong
  • Clustering Technique for Collaborative Filtering Recommendation and Application to Venue Recommendation

    1. 1. Clustering Techniques for Collaborative Filtering and the Application to Venue Recommendation Manh Cuong Pham , Yiwei Cao, Ralf Klamma Information Systems and Database Technology RWTH Aachen, Germany Graz , Austria, September 01, 2010 I-KNOW 2010
    2. 2. Agenda <ul><li>Introduction </li></ul><ul><li>Clustering techniques for collaborative filtering </li></ul><ul><li>Case study: venue recommendation </li></ul><ul><ul><li>Data sets: DBLP and CiteSeerX </li></ul></ul><ul><ul><li>User-based </li></ul></ul><ul><ul><li>Item-based </li></ul></ul><ul><li>Conclusions and Outlook </li></ul>
    3. 3. Introduction <ul><li>Recommender systems: help users dealing with information overload </li></ul><ul><li>Components of a recommender system [ Burke2002 ] </li></ul><ul><ul><li>Set of users, set of items (products) </li></ul></ul><ul><ul><li>Implicit/explicit user rating on items </li></ul></ul><ul><ul><li>Additional information: trust, collaboration, etc. </li></ul></ul><ul><ul><li>Algorithms for generating recommendations </li></ul></ul><ul><li>Recommendation techniques [ Adomavicius and Tuzhilin 2005 ] </li></ul><ul><ul><li>Collaborative Filtering (CF) [Breese et al. 1998 ] </li></ul></ul><ul><ul><ul><li>Memory-based algorithms: user-based, item-based [Sarwar 2001] </li></ul></ul></ul><ul><ul><ul><li>Model-based algorithms: Bayesian network [ Breese1998 ] ; Clustering [ Ungar 1998 ] ; Rule-based [ Sarwar2000 ] ; Machine learning on graphs [Zhou 2005, 2008]; PLSA [Hofmann 1999] ; Matrix factorization [Koren 2009] </li></ul></ul></ul><ul><ul><li>Content-based recommendation [Sarwar et al. 2001] </li></ul></ul><ul><ul><li>Hybrid approaches [Burke 2002] </li></ul></ul>
    4. 4. Clustering and Collaborative Filtering Cluster 2 Cluster 1 item-based CF User clustering Item clustering item-based CF item-based CF <ul><li>Problems: large-scale data; sparse rating matrix; </li></ul><ul><li>diversity of users and items </li></ul><ul><li>Previous approaches: Clustering based on ratings </li></ul><ul><ul><li>K-means, Metis, etc. [Rashid 2006, Xue 2005, O’Connor 2001] </li></ul></ul><ul><li>Our approach </li></ul><ul><ul><li>Clustering based on additional information: relationships between users, items </li></ul></ul><ul><ul><li>Improvement on both efficiency and accuracy </li></ul></ul>x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x
    5. 5. Evaluation: Venue Recommendation <ul><li>Recommend venues (conferences, journals, workshops) to researchers </li></ul><ul><li>User-based CF </li></ul><ul><ul><li>Populate user-item matrix using venue participation history </li></ul></ul><ul><ul><li>Ratings: normalized venue publication counts </li></ul></ul><ul><ul><li>User-clustering: co-authorship network </li></ul></ul><ul><li>Item-based CF </li></ul><ul><ul><li>Similarity between venues based on citation </li></ul></ul><ul><ul><li>Similarity measure: cosine </li></ul></ul><ul><ul><li>Venue clustering: similarity network </li></ul></ul>
    6. 6. Data Sets <ul><li>DBLP ( </li></ul><ul><ul><li>788,259 author’s names </li></ul></ul><ul><ul><li>1,226,412 publications </li></ul></ul><ul><ul><li>3,490 venues (conferences, workshops, journals) </li></ul></ul><ul><li>CiteSeerX ( </li></ul><ul><ul><li>7,385,652 publications (including publications in reference lists) </li></ul></ul><ul><ul><li>22,735,240 citations </li></ul></ul><ul><ul><li>Over 4 million author’s names </li></ul></ul><ul><li>Combination </li></ul><ul><ul><li>Canopy clustering [ McCallum 2000 ] </li></ul></ul><ul><ul><li>Result: 864,097 matched pairs </li></ul></ul><ul><ul><li>On average: venues cite 2306 and </li></ul></ul><ul><ul><li>are cited 2037 times </li></ul></ul>
    7. 7. User-based CF: Author Clustering <ul><li>Data: DBLP </li></ul><ul><li>Perform 2 test cases for the years of 2005 and 2006 </li></ul><ul><ul><li>Clustering of co-authorship networks </li></ul></ul><ul><ul><li>2005s network: 478,108 nodes; 1,427,196 edges </li></ul></ul><ul><ul><li>2006s network: 544,601 nodes; 1,686,867 edges </li></ul></ul><ul><ul><li>Prediction of the venue participation </li></ul></ul><ul><li>Clustering algorithm </li></ul><ul><ul><li>Density-based algorithm [Clauset 2004 ] </li></ul></ul><ul><ul><li>Obtained modularity: 0.829 and 0.82 </li></ul></ul><ul><li>Cluster size distribution follows Power law </li></ul>
    8. 8. User-based CF: Performance <ul><li>Precisions for 1000 random chosen authors </li></ul><ul><li>Precisions computed at 11 standard recall levels 0%, 10%,….,100% </li></ul><ul><li>Results </li></ul><ul><ul><li>Clustering performs better </li></ul></ul><ul><ul><li>Not significant improved </li></ul></ul><ul><ul><li>Better efficiency </li></ul></ul><ul><li>Further improvement </li></ul><ul><ul><li>Different networks: citation </li></ul></ul><ul><ul><li>Overlapping clustering </li></ul></ul>
    9. 9. Item-based CF: Venue Network Creation and Clustering <ul><li>Knowledge network </li></ul><ul><ul><li>Aggregate bibliography coupling counts at venue level </li></ul></ul><ul><ul><li>Undirected graph G(V, E) , where V : venues, E : edges weighted by cosine similarity </li></ul></ul><ul><ul><li>Threshold: </li></ul></ul><ul><ul><li>Clustering: density-based algorithm [ Neuman 2004, Clauset 2004 ] </li></ul></ul><ul><ul><li>Network visualization: force-directed paradigm [ Fruchterman 1991 ] </li></ul></ul><ul><li>Knowledge flow network (for venue ranking, see Pham & Klamma 2010 ) </li></ul><ul><ul><li>Aggregate bibliography coupling counts at venue level </li></ul></ul><ul><ul><li>Threshold: citation counts >= 50 </li></ul></ul><ul><ul><li>Domains from Microsoft Academic Search ( </li></ul></ul>
    10. 10. Knowledge Network: the Visualization
    11. 11. Knowledge Network: Clustering
    12. 12. Interdisciplinary Venues: Top Betweenness Centrality
    13. 13. High Prestige Series: Top PageRank
    14. 14. Conclusions and Future Research <ul><li>Clustering and recommender systems </li></ul><ul><ul><li>Advantage of using additional information for clustering </li></ul></ul><ul><ul><li>Application of clustering for both user-based and item-based CF </li></ul></ul><ul><ul><li>Key issue: impact of the communities (cluster) on the quality of recommendations; non-overlapping communities vs. overlapping communities </li></ul></ul><ul><li>Outlook </li></ul><ul><ul><li>Further evaluation: trust networks clustering, paper and potential collaborator recommendation </li></ul></ul><ul><ul><li>Datasets: Epinion,, etc. </li></ul></ul><ul><ul><li>Digital libraries in Web 2.0: Mendeley, ResearchGate, etc. </li></ul></ul>