Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Leabharlann UCD
An Coláiste Ollscoile, Baile
Átha Cliath,
Belfield, Baile Átha Cliath 4,
Eire
UCD Library
University Colle...
Overview and take-home points
• Usage stats are important
– (go to the Usage Stats panel on Thursday,
16/Jun/2016: 11:00am...
Experimental study
• Simple random sample of 2 years of UCD
repository’s download data
– n=341, N=3.3 million; 96.20% cert...
First finding
85% of unfiltered
repository downloads
come from robots
• This is confirmed in a 2013 IRUS-UK white paper
on...
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Accuracyofdownloadstats(inverseprecition)
...
Robot detection techniques used
DSpace EPrints
Minho DSpace
Statistics Add-on
Rate of requests ✓3
User agent string ✓ ✓ ✓
...
Measurements used in robot detection
• All measurements are a number between 0 and 1
• Recall: proportion of robots detect...
How they perform, out-of-the-box
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DSpace EPrints Minho Minho with
monthly manual
ch...
Room for improvement?
1. Ability to manually check for outliers
• At UCD, once a month, we check:
– Daily downloads for the last 2-4 months
– To...
0
0.2
0.4
0.6
0.8
1
DSpace Eprints Minho
Robots caught (Recall)
Out-…
0
0.2
0.4
0.6
0.8
1
DSpace Eprints Minho Wihtout rob...
2. Recalibrate the EPrints repeat-
download (double-click) filter
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall (robots) ...
3. Port Minho’s robot detection code (a
log parser) onto DSpace or EPrints
• 1 Java class
• Input is Apache Combined Log F...
0
0.2
0.4
0.6
0.8
1
DSpace Eprints Minho
Robots caught (Recall)
0
0.2
0.4
0.6
0.8
1
DSpace Eprints Minho Wihtout robot
det...
4. Combine two or more techniques
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DSpace Eprints Minho
Robots caught
(Recall)
Out-...
4. Combine two or more techniques
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DSpace Eprints Minho Wihtout robot
detection
Acc...
Thank you!
#iCanHazRobot?: improved robot detection for IR usage statistics
#iCanHazRobot?: improved robot detection for IR usage statistics
#iCanHazRobot?: improved robot detection for IR usage statistics
#iCanHazRobot?: improved robot detection for IR usage statistics
Upcoming SlideShare
Loading in …5
×

#iCanHazRobot?: improved robot detection for IR usage statistics

741 views

Published on

Presentation given by Joseph Greene, Research Repository Librarian at University College Dublin Library, at Open Repositories held at Trinity College Dublin, June 13-16th, 2016.

Published in: Education
  • Be the first to comment

#iCanHazRobot?: improved robot detection for IR usage statistics

  1. 1. Leabharlann UCD An Coláiste Ollscoile, Baile Átha Cliath, Belfield, Baile Átha Cliath 4, Eire UCD Library University College Dublin, Belfield, Dublin 4, Ireland Joseph Greene Research Repository Librarian University College Dublin joseph.greene@ucd.ie http://researchrepository.ucd.ie #iCanHazRobot? Improved robot detection for IR usage statistics Open Repositories 2016 Dublin, 14 June
  2. 2. Overview and take-home points • Usage stats are important – (go to the Usage Stats panel on Thursday, 16/Jun/2016: 11:00am - 12:30pm) • Robot filtration is a problem, especially in repositories • Robot detection has an exponential effect on usage stats’ accuracy in repositories • 2-3 ways to improve DSpace and EPrints’ usage stats by 20% or more will be demonstrated
  3. 3. Experimental study • Simple random sample of 2 years of UCD repository’s download data – n=341, N=3.3 million; 96.20% certainty • Manually checked to determine if robot or human • Applied DSpace, EPrints robot detection algorithms to the dataset – This is an EXPERIMENT, simulating algorithms on a DSpace repository’s usage data and Apache logs – The data is real, live data, and the algorithms were very easy to simulate
  4. 4. First finding 85% of unfiltered repository downloads come from robots • This is confirmed in a 2013 IRUS-UK white paper on 20 IRs; 85% was also found to be robots
  5. 5. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Accuracyofdownloadstats(inverseprecition) Recall (robots) Catching more robots improves stats (But how much depends on the number of robots) Getbetterstats Catch more robots Typical website, 15% robot traffic OA journal, 40% robot Internet Archive, 91% robot OA repositories, 85% robot
  6. 6. Robot detection techniques used DSpace EPrints Minho DSpace Statistics Add-on Rate of requests ✓3 User agent string ✓ ✓ ✓ robots.txt access ✓ Volume of requests ✓2 ✓3 List of known robot IP addresses ✓ ✓ Reverse DNS name lookup ✓1 Trap file ✓ User agents per IP address Width of traversal in the URL space ✓3 1 Only implemented nominally or experimentally 2 Via the repeat download or ‘double-click’ filter 3 Data available as a configurable report for manual decision making
  7. 7. Measurements used in robot detection • All measurements are a number between 0 and 1 • Recall: proportion of robots detected – I can haz robot? • Precision: true positives in robot detection – Proportion of discounted downloads that are actually made by robots (sometimes humans are counted as robots) • Accuracy of download stats measured as inverse precision: – Proportion of stats that are actually made by humans
  8. 8. How they perform, out-of-the-box 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DSpace EPrints Minho Minho with monthly manual checking No robot detection Robot detection in OA IR systems Recall Precision Negative precision (accuracy of download stats)
  9. 9. Room for improvement?
  10. 10. 1. Ability to manually check for outliers • At UCD, once a month, we check: – Daily downloads for the last 2-4 months – Top 10 most downloaded items – Top 20 downloading IP addresses for the last 2-4 months
  11. 11. 0 0.2 0.4 0.6 0.8 1 DSpace Eprints Minho Robots caught (Recall) Out-… 0 0.2 0.4 0.6 0.8 1 DSpace Eprints Minho Wihtout robot detection Accuracy of reported download stats (Inverse precision) Out-of-the-box With manual checking (outlier exclusion)
  12. 12. 2. Recalibrate the EPrints repeat- download (double-click) filter 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall (robots) Precision (accuracy of excluded downloads) Inverse recall (legitimate downloads accounted for in stats) Inverse precision (accuracy of reported download stats) Overall accuracy Effect of double-click filter on EPrints’ robot detection and stats Without double-click filter With double-click filter (out-of-the-box) With recalibrated double-click filter* 𝑻𝒑 + 𝑻𝒏 𝒏
  13. 13. 3. Port Minho’s robot detection code (a log parser) onto DSpace or EPrints • 1 Java class • Input is Apache Combined Log Format • Output is a database update (robot = true field) – Similar to EPrints' $is_robot variable in Robots.pm, – Could be modified to update the DSpace 'isBot' field in the SOLR usage events document • Requires 2 database tables to store learned agents and IPs
  14. 14. 0 0.2 0.4 0.6 0.8 1 DSpace Eprints Minho Robots caught (Recall) 0 0.2 0.4 0.6 0.8 1 DSpace Eprints Minho Wihtout robot detection Accuracy of reported download stats (Inverse precision) Out-of-the-box With Minho log parser
  15. 15. 4. Combine two or more techniques 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DSpace Eprints Minho Robots caught (Recall) Out-of-the-box With manual checking (outlier exclusion) With recalibrated double click filter* With Minho log parser With Minho and outliers Minho, outliers, and recalibrated double- click*
  16. 16. 4. Combine two or more techniques 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DSpace Eprints Minho Wihtout robot detection Accuracy of reported download stats (Inverse precision) Out-of-the-box With manual checking (outlier exclusion) With recalibrated double click filter* With Minho log parser With Minho and outliers Minho, outliers, and recalibrated double- click*
  17. 17. Thank you!

×