Usage-Based vs. Citation-BasedRecommenders in a Digital Library                   André Vellino 	          School of Infor...
Context—  Canada Institute for Scientific and Technical Information  (aka Canada’s National Science Library)—  Has a ful...
Sparsity of Usage Data is a Problem inDigital Libraries    Amazon             Digital Libraries Users       Items         ...
Data is Sparse Too                                  edges user-item graph 	—  Sparseness of a dataset S =                ...
(2009)	ExLibris bX solution to data sparsity:   Harvest lots usage (co-download)   behaviour from world-wide SFX (Ex   Lib...
TechLens+ Citation-Based Recommdendation          p2	                                                                     ...
Does “Rated” Citations w/ PageRank Help?                       p1 p2 p3 p4 p5 p6 p7 p8                         citations  ...
Sarkanto (NRC Article Recommender)—  Uses TechLens+ strategy of replacing User-Item matrix with    Article-Article matrix...
Sarkanto compared w/ bX“These are articles whose co-         “Users who viewed this article alsocitations are similar to t...
Experiments—  Sarkanto generated ~ 1.9 million citation-based    recommendations (statically)—  Experimental comparison ...
Measuring Semantic Diversity—  Question: what is the semantic distance between the source-    article and the recommendat...
Journal-Journal Semantic Distance  —  Concatenate the full-text of all the articles in each journal  —  From a Lucene in...
2-D Journal Distance Map                                              Colours clusters represent	                         ...
Results: Diversity of Recommendations—  ~13% of seed articles generated recommendations for both    bX and Sarkanto (i.e....
Conclusions—  Citation-based and User-based recommendations are    complementary—  Different kinds of data sources (user...
Upcoming SlideShare
Loading in …5
×

Usage-Based vs. Citation-Based Recommenders in a Digital Library

683 views

Published on

There are two principal data sources for collaborative filtering recommenders in scholarly digital libraries: usage data obtained from harvesting a large, distributed collection of Open URL web logs and citation data obtained from the journal articles.

This study explores the characteristics of recommendations generated by implementations of these two methods: the `bX' system by ExLibris and an experimental citation-based recommender, Sarkanto. Recommendations from each system were compared according to their semantic similarity to the seed article that was used to generate them. Since the full text of the articles was not available for all the recommendations in both systems, the semantic similarity between the seed article and the recommended articles was deemed to be the semantic distance between the journals in which the articles were published. The semantic distance between journals was computed from the ``semantic vectors'' distance between all the terms in the full-text of the available articles in that journal and this study shows that citation-based recommendations are more semantically diverse than usage-based ones.

These recommenders are complementary since most of the time, when one recommender produces recommendations the other does not.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
683
On SlideShare
0
From Embeds
0
Number of Embeds
37
Actions
Shares
0
Downloads
14
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Usage-Based vs. Citation-Based Recommenders in a Digital Library

  1. 1. Usage-Based vs. Citation-BasedRecommenders in a Digital Library André Vellino School of Information Studies  University of Ottawa blog: http://synthese.wordpress.com twitter: @vellino e-mail: avellino@uottawa.ca
  2. 2. Context—  Canada Institute for Scientific and Technical Information (aka Canada’s National Science Library)—  Has a full-text digital collection (Scientific, Technical, Medical) with text-mining rights for research purposes only —  Elsevier and Springer (mostly) —  ~8M articles —  ~2800 journals —  ~ 3TB—  Plan: a Hybrid, Multi-Dimensional —  Usage-based (CF) —  Content-based (CBF) —  User-Context
  3. 3. Sparsity of Usage Data is a Problem inDigital Libraries Amazon Digital Libraries Users Items Items Users ~70,000~ 70 M ~ 93 M ~7M
  4. 4. Data is Sparse Too edges user-item graph —  Sparseness of a dataset S = total number of possible edges —  Mendeley data S = 2.66 x 10-05—  Neflix S = 1.18 x 10-02—  But also, Mendeley data isn’t “highly connected” —  83.6% of Mendeley articles were referenced by only 1 user —  6% of the articles were referenced by 3 or more users.
  5. 5. (2009) ExLibris bX solution to data sparsity: Harvest lots usage (co-download) behaviour from world-wide SFX (Ex Libris Open URL resolver) logs and apply collaborative filtering to correlate articles. Johan Bollen and Herbert Van de Sompel. An architecture for the aggregation and analysis of scholarly usage data. (in JCDL2006)
  6. 6. TechLens+ Citation-Based Recommdendation p2 References Articles p3 p5 R. Torres, S. McNee, M. Abel, J. Konstan, and J. Riedl. Enhancing Digital Libraries with TechLens+. (in JCDL 2004)
  7. 7. Does “Rated” Citations w/ PageRank Help? p1 p2 p3 p4 p5 p6 p7 p8 citations p1 0.4  p2 0.5 0.4 articles p3 0.2 0.6 p4 0.7 0.5  u1 0.5 0.3 0.6  users u2 0.2 0.3   = constantAnswer: Using PageRank to “rate” citations is not significantly Better than using a constant (0/1) Note: There is ongoing work w/ NRC on machine learning method for extracting “most important references” – that might help more
  8. 8. Sarkanto (NRC Article Recommender)—  Uses TechLens+ strategy of replacing User-Item matrix with Article-Article matrix from citation data—  Uses TASTE recommender (now the recommendation component of Mahout)—  Is now decoupled from user-based recommender—  Compare side by side w/ ‘bX’ recommendationsTry it here: http://lab.cisti-icist.nrc-cnrc.gc.ca/Sarkanto/
  9. 9. Sarkanto compared w/ bX“These are articles whose co- “Users who viewed this article alsocitations are similar to this one.” viewed these articles.”
  10. 10. Experiments—  Sarkanto generated ~ 1.9 million citation-based recommendations (statically)—  Experimental comparison done on 1886 randomly selected articles from a subset of ~ 1.2M articles (down from ~ 8M)—  Questions asked in the experiment: —  How many recommendations produced by each recommender —  Coverage (how often does a seed article generate a recommendation) —  How semantically diverse are the recommendations
  11. 11. Measuring Semantic Diversity—  Question: what is the semantic distance between the source- article and the recommendations?—  In this setup it was not possible to compare the semantic distance without the full-text for both set of recommendations—  Full-text is available for the Sarkanto recommendations but not for the bX recommendations
  12. 12. Journal-Journal Semantic Distance —  Concatenate the full-text of all the articles in each journal —  From a Lucene index of the full text in each journal, use Dominic Widdows’ Semantic Vectors package to create —  a term-journal matrix, —  reduced dimensionality term-vectors (512) for each journal using random projections —  Apply multidimensional scaling (MDS) in R to obtain a 2-D distance matrix (2300 x 2300)G. Newton, A. Callahan, and M. Dumontier. Semantic journal mapping for search visualization in a large scale article digital library in Second Workshop on Very Large Digital Libraries, ECDL 2009
  13. 13. 2-D Journal Distance Map Colours clusters represent Journal subject headings (from publisher metadata) http://cuvier.cisti.nrc.ca/~gnewton/torngat/applet.2009.07.22/index.html
  14. 14. Results: Diversity of Recommendations—  ~13% of seed articles generated recommendations for both bX and Sarkanto (i.e. not much overlap!)—  Citation-based recommendations appear to be more semantically diverse than User-based.
  15. 15. Conclusions—  Citation-based and User-based recommendations are complementary—  Different kinds of data sources (users vs. citations) produce different kinds of (non-overlapping) results—  Citation-based recommendations are more semantically diverse —  Hypothesis:“user-based recommendations may be biased by the semantic similarity of search-engine results”

×