Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How Accurate are IR Usage Statistics?

422 views

Published on

Presentation given by Joseph Greene, Research Repository Librarian at University College Dublin Library, at Open Repositories 2016, held at Trinity College Dublin, June 13-16th, 2016.

Published in: Education
  • Be the first to like this

How Accurate are IR Usage Statistics?

  1. 1. Leabharlann UCD An Coláiste Ollscoile, Baile Átha Cliath, Belfield, Baile Átha Cliath 4, Eire UCD Library University College Dublin, Belfield, Dublin 4, Ireland Joseph Greene Research Repository Librarian University College Dublin joseph.greene@ucd.ie http://researchrepository.ucd.ie How accurate are IR usage statistics? Open Repositories 2016 Dublin, 16 June
  2. 2. Usage statistics are important for OA repositories • How is the service used overall? • Advocacy – Connects with authors on what is most important to them: the use of their research • KPI for return on investment – Usage of a Library service – Visibility of university’s research
  3. 3. Monthly email sent to all depositors
  4. 4. Infographic distributed semi-annually by College Liaison Librarians
  5. 5. How accurate are they? Web robots • Some follow rules – Search engines, Internet Archive, link checkers, Twitterbot, etc. – robots.txt, naming themselves in the user agent string • Others do not – Email spammers, comment spammers, dictionary attackers, phishers, etc. – Often mimic human users
  6. 6. Experimental study • Simple random sample of 2 years of UCD repository’s download data – n=341, N=3.3 million; 96.20% certainty • Manually checked to determine if robot or human • Compared findings against our robot detection technique – U. Minho DSpace Stats Add-on – Monthly outlier exclusion (manual) Greene, J. Web robot detection in scholarly Open Access institutional repositories. Library Hi Tech, July 2016
  7. 7. First finding 85% of the Research Repository UCD’s unfiltered downloads come from robots • This is confirmed in a 2013 IRUS-UK white paper on 20 IRs; 85% was also found to be robots
  8. 8. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Accuracyofdownloadstats(inverseprecition) Recall (robots) Catching more robots improves stats (But how much depends on the number of robots) Getbetterstats Catch more robots Typical website, 15% robot traffic OA journal, 40% robot Internet Archive, 91% robot OA repositories, 85% robot
  9. 9. How did we do at UCD? • What proportion of robot downloads did we catch? (Recall) – Our method catches 94% of all robots • How often were we correct -- how many are actually human? (Precision) – 98.9% of downloads that we label robots really are robots • How accurate are the download stats -- how many are actually made by human beings? (Inverse precision) – 73% of the download statistics as reported are human
  10. 10. How does that compare? • Who knows? There are no other studies like this on repositories! • Applied DSpace's and EPrints' web robot detection algorithms to our data – Experimental – Real data – Same dataset used for each ‘system’ – Algorithms easy to mimic in vitro – But SEO, crawl behaviour may be different for different systems
  11. 11. Robot detection techniques used DSpace EPrints Minho DSpace Statistics Add-on Rate of requests ✓3 User agent string ✓ ✓ ✓ robots.txt access ✓ Volume of requests ✓2 ✓3 List of known robot IP addresses ✓ ✓ Reverse DNS name lookup ✓1 Trap file ✓ User agents per IP address Width of traversal in the URL space ✓3 1 Only implemented nominally or experimentally 2 Via the repeat download or ‘double-click’ filter 3 Data available as a configurable report for manual decision making
  12. 12. Results
  13. 13. 0.897 0.911 0.890 0.942 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DSpace Eprints Minho (no manual outlier checking) Minho plus monthly manual checking (UCD) Robots detected (Recall)
  14. 14. 1.000 0.940 0.989 0.989 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DSpace Eprints Minho (no manual outlier checking) Minho plus monthly manual checking (UCD) Accuracy of detection (Precision)
  15. 15. 0.620 0.552 0.590 0.730 0.144 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DSpace Eprints Minho (no manual outlier checking) Minho plus monthly manual checking (UCD) Without filtration Accuracy of download stats (Inverse precision) I.e. 38% of DSpace’s reported downloads are made by robots, etc.
  16. 16. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DSpace EPrints Minho Minho with monthly manual checking (UCD) No robot detection Robot detection in OA IR systems Recall Precision Negative precision (accuracy of download stats)
  17. 17. Thank you!

×