Synthese Recommender System


Published on

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Purpose of the talk – to describe an approach to addressing the problem of recommender systems for a DL The system is a work in progress. In collaboration with David Zeber – Ph.D. student in Statistics @ Cornell
  • Synthese Recommender System

    1. 1. A Hybrid, Multi-Dimensional Recommender for Journal Articles in a Scientific Digital Library Andre Vellino [email_address] Canada Institute for Scientific and Technical Information National Research Council WPRS 07 2 November 2007 David Zeber [email_address] Dept. of Statistics Cornell University
    2. 2. Outline of Talk <ul><li>Introduction and Motivation </li></ul><ul><li>Problems for Article Recommender Systems </li></ul><ul><li>Proposed Solutions: </li></ul><ul><ul><li>Hybrid of Collaborative Filtering (CF) + Content Based Filtering (CBF) </li></ul></ul><ul><ul><li>PageRanked Citations </li></ul></ul><ul><ul><li>Multi-Dimensional based on IR Modes </li></ul></ul><ul><ul><li>Explanation-Based Interface </li></ul></ul><ul><li>Future Work </li></ul>
    3. 3. Introduction and Motivation <ul><li>CISTI – who we are </li></ul><ul><ul><li>Government Science Library for Canada </li></ul></ul><ul><ul><li>Publisher of 16 science journals </li></ul></ul><ul><ul><li>Digital Collection </li></ul></ul><ul><ul><ul><li>3667 Journals </li></ul></ul></ul><ul><ul><ul><li>6,400,000 articles </li></ul></ul></ul><ul><li>Motivation for this project </li></ul><ul><ul><li>To enhance the process of scientific innovation by providing high-quality, serendipitous, article recommendations </li></ul></ul>
    4. 4. Typical Issues with CF Recommenders <ul><li>Data Sparsity </li></ul><ul><ul><li>Ratio of Users / Items is low (~ 1:10) </li></ul></ul><ul><ul><li>Number of Ratings per User is low </li></ul></ul><ul><ul><li>Ratings matrix sparsity ~ 95% </li></ul></ul><ul><li>Cold Start Problem </li></ul><ul><ul><li>First-time users get poor or no recommendations because CF matrix has no entries </li></ul></ul><ul><li>Rating Items </li></ul><ul><ul><li>CF recommender must be trained (explicitly or implicitly) by providing ratings to items </li></ul></ul><ul><li>Principle of Induction </li></ul><ul><ul><li>People who exhibited similar behaviour in the past will tend to exhibit similar behaviour in the future. </li></ul></ul>
    5. 5. Specific Issues for Science Digital Libraries <ul><li>Data Sparsity </li></ul><ul><ul><li>More Articles & Fewer Users (10x) </li></ul></ul><ul><ul><li>Fewer Item / Ratings (~ 99% sparsity) </li></ul></ul><ul><li>Rating Articles </li></ul><ul><ul><li>Explicit ratings are more difficult to obtain </li></ul></ul><ul><ul><ul><li>DL users have less need to “express themselves” by explicitly rating items than movie watchers </li></ul></ul></ul><ul><ul><li>Implicit ratings depend on UI features of DL </li></ul></ul><ul><ul><ul><li>No reliable method for inferring ratings from browsing and query behaviour </li></ul></ul></ul><ul><li>Principle of Induction not necessarily true in DL context </li></ul><ul><ul><li>Interest drift </li></ul></ul><ul><ul><li>Context shifts </li></ul></ul>
    6. 6. General Research Strategy <ul><li>Follow in footsteps of TechLens+ </li></ul><ul><ul><li>“ Fusion Mixed Hybrid” : CF + CBF </li></ul></ul><ul><ul><li>Seed CF recommender with citation matrix </li></ul></ul><ul><ul><li>Incorporate explanation feature in results interface </li></ul></ul><ul><li>With Extensions </li></ul><ul><ul><li>PageRank on Citations </li></ul></ul><ul><ul><li>User IR modes and projects </li></ul></ul><ul><ul><li>Implicit user ratings derived from clickstream + project + mode </li></ul></ul><ul><ul><li>Distributed Multi-Dimensional Recommender </li></ul></ul><ul><ul><li>Explanation-based interface </li></ul></ul>
    7. 7. Recommender Citation Seeding <ul><li>Articles either cite or don’t cite other articles </li></ul><ul><li>Some articles that are cited are not in collection </li></ul><ul><li>Users’ “article collection profile”  citations </li></ul>TechLens approach to Cold Start / Data Sparsity problem
    8. 8. Apply PageRank to Citation Matrix <ul><li>PageRank algorithm applied to citations </li></ul><ul><li>d – damping factor = 0.85 </li></ul><ul><li>PR (  ) – PageRank score of article  </li></ul><ul><li>B (  ) – articles that that cite  </li></ul><ul><li>N  – number of citations for article  </li></ul>Aurel Constantinescu “Ranking Full-Text Articles using Citation Based Methods” Master’s Thesis, University of Ottawa 47.5 135 87.5 47.5 47.5 87.5 87.5
    9. 9. PageRank-weighted Citation matrix <ul><li>Apply Page Rank on Citations </li></ul><ul><ul><li>Use citation data (as in TechLens+) </li></ul></ul><ul><ul><li>Apply PageRank to weight the citation-based “ratings” </li></ul></ul><ul><li>Done before but only at the Journal level ( http:// / ) </li></ul>p 6 p 1 p 5 p 2 p 4 p 3 u 2 p 1 u 1 p 2 p 4 p 3 articles citations p 7 p 8  = constant users     0.3 0.2 0.6 0.3 0.5 0.5 0.7 0.6 0.2 0.4 0.5 0.4
    10. 10. User Project Profiles & IR Modes <ul><li>Project Profiles </li></ul><ul><li>Explicit User-defined Projects </li></ul><ul><ul><li>Subject-matter expertise (Novice / Knowledgeable / Expert) </li></ul></ul><ul><li>Defined by a document collection that characterizes the project: </li></ul><ul><ul><li>By content - the feature vectors (bag of words) from that collection </li></ul></ul><ul><ul><li>By CF similarity from “citations” list for the user </li></ul></ul><ul><li>IR Modes </li></ul><ul><li>Users of DLs have a broad range of IR goals, such as </li></ul><ul><ul><li>seeking answers to highly specific scientific questions </li></ul></ul><ul><ul><li>developing literature surveys </li></ul></ul><ul><ul><li>establishing prior art for patent claims </li></ul></ul><ul><li>“ innovation” / “information” / “authority” </li></ul>
    11. 11. Implicit Preferences Generation In Context Search Terms Full Text Author Keyphrase Journal Abstract Project IR Mode Clickstream User State
    12. 12. Multi-Dimensional Ratings Matrix Tom Alice Bob Carol p 1 p 2 p 3 p 4 p 5 p 6 Innovation Information Authority 0.3 0.6 0.3 0.7 0.4 0.7 0.2 G. Adomavicious, R. Sankaranarayanan, S. Sen, A. Tuzhilin, ACM Transactions on Information Systems 2005 Incorporating Contextual Information in Recommender Systems Using a Multidimensional Approach 0.7 0.2 0.5
    13. 13. Scaling Strategy: Distributed Recommenders <ul><li>Multiple ratings matrices decomposed by subject area </li></ul><ul><li>Merge separate recommendations by subject </li></ul><ul><li>Reduces matrix sparsity </li></ul><ul><li>Improves accuracy of recommendations </li></ul>Distributed Collaborative Filtering with Domain Specialization S. Berkovsky, T.Kuflik, and F. Ricci Proceedings of RecSys2007
    14. 14. UI for Navigating Recommendations <ul><li>Inspiration for “recommendation refinement” UI </li></ul><ul><ul><li>Carrot 2 Cluster maps </li></ul></ul><ul><li>Explanation-based Recommendations </li></ul><ul><ul><li>Provide transparency  increase user trust </li></ul></ul><ul><ul><li>Take advantage of explanation to: </li></ul></ul><ul><ul><ul><li>Allow users to cluster by type of cause </li></ul></ul></ul><ul><ul><ul><li>Filter out recommendations </li></ul></ul></ul>
    15. 15. Carrot 2 Cluster maps 2D projection of Recommended Item-User Similarity Explanation Clusters Dimensionality weighting slider
    16. 16. Future Work <ul><li>V.1 of Synth è se </li></ul><ul><ul><li>PageRanked Citations </li></ul></ul><ul><ul><li>Mixed Hybrid Recommender </li></ul></ul><ul><li>Study effect of PageRank on Recommendation Rankings </li></ul><ul><li>Build v.2 of Synth è se </li></ul><ul><ul><li>Modal user profiles </li></ul></ul><ul><ul><li>Distributed Multi-Dimensional Recommender </li></ul></ul><ul><li>Study effect of Multi-Dimensionality </li></ul><ul><li>Explore distributed recommender strategy </li></ul><ul><li>Study impact of additional information </li></ul><ul><ul><li>Author Hirsch Index / “Faculty of 1000” Article Ratings </li></ul></ul><ul><li>Refine content-based filtering with domain-specific semantics </li></ul><ul><li>Incorporate privacy protection measures to ensure anonymity </li></ul>
    17. 17. THANK YOU! Questions? /
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.