Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Leabharlann UCD
An Coláiste Ollscoile, Baile
Átha Cliath,
Belfield, Baile Átha Cliath 4,
Eire
UCD Library
University Colle...
Counting downloads
• Open Access repositories make science and
scholarship accessible, and we need to
demonstrate our valu...
Enter the Robot
• At least 18% of web requests are from robots
• Less than half can be accounted for by the five
main sear...
What are you talking about?
Internet robot, Web robot, automated agent,
crawler, spider, bot: any programme that visits
we...
Good and bad
• Search engines, link verifiers, computer science
experiments
• Gathering content for spam, phishing and cop...
‘And the noisy, nasty nuisance grew, ‘til
the villagers cried, “What can we do?”’
Detection methods:
• Blocking robots in ...
Appropriate, but problematic methods
for repositories
• Excluding known robots by user-agent name
– Easily faked or omitte...
Effectiveness, and repository out-of-the-
box repository strategies
Strength
Robots detected by Recall (%) Precision (%)
N...
Effectiveness, and repository out-of-the-
box repository strategies
Strength
Robots detected by Recall (%) Precision (%)
N...
Effectiveness, and repository out-of-the-
box repository strategies
Strength
Robots detected by Recall (%) Precision (%)
N...
Centralised strategy: IRUS-UK
• Collects and filters statistics from 84 DSpace and
Eprints repositories
• COUNTER complian...
Sources by slide
1 Bill Gosper's Glider Gun in action—a variation of Conway's Game of Life. Johan G.
Bontes.
<https://en.w...
Thank you!
Upcoming SlideShare
Loading in …5
×

Robot Hunter: or precisely what I thought I wouldn't be doing when I became a librarian

625 views

Published on

Presentation given by Joseph Greene, Research Repository Librarian at University College Dublin Library, to the Inaugural CONUL Conference, June 4th, 2015 in Athlone, Ireland.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Robot Hunter: or precisely what I thought I wouldn't be doing when I became a librarian

  1. 1. Leabharlann UCD An Coláiste Ollscoile, Baile Átha Cliath, Belfield, Baile Átha Cliath 4, Eire UCD Library University College Dublin, Belfield, Dublin 4, Ireland Robot hunter Or, precisely what I thought I wouldn’t be doing when I became a librarian Joseph Greene Research Repository Librarian joseph.greene@ucd.ie http://researchrepository.ucd.ie
  2. 2. Counting downloads • Open Access repositories make science and scholarship accessible, and we need to demonstrate our value • Simple question: how often are these papers used? How many times have they been downloaded?
  3. 3. Enter the Robot • At least 18% of web requests are from robots • Less than half can be accounted for by the five main search engines • At Research Repository UCD, 2/3rds of our repository’s downloads are marked as web robots
  4. 4. What are you talking about? Internet robot, Web robot, automated agent, crawler, spider, bot: any programme that visits websites and systematically retrieves information from them
  5. 5. Good and bad • Search engines, link verifiers, computer science experiments • Gathering content for spam, phishing and copycat sites, artificially improving a website’s ranking (spamdexing), looking for security holes, DDoS attacks…………
  6. 6. ‘And the noisy, nasty nuisance grew, ‘til the villagers cried, “What can we do?”’ Detection methods: • Blocking robots in real-time: Turing tests • Detecting later and removing from statistics
  7. 7. Appropriate, but problematic methods for repositories • Excluding known robots by user-agent name – Easily faked or omitted • Excluding by IP address – DHCP, and list is growing exponentially • Usage pattern analysis: query rate and resources requested – Expensive to automate • Machine learning: training decision trees, neural nets and/or statistical systems – Did you say expensive??? • Combined approaches
  8. 8. Effectiveness, and repository out-of-the- box repository strategies Strength Robots detected by Recall (%) Precision (%) No images requested 98.34 75.48 No referring site 96.27 52.25 List of IP addresses 69.29 99.40 HEAD method to access site 32.37 100.00 Agent name declared 26.56 100.00 Access only at night 24.48 50.43 Robots.txt file accessed 17.01 100.00 Time, σ (3s) 2.49 100.00 Time, average (1s) 2.49 75.00 DSpace uses IP addresses of known agents – much weaker than in the benchmarking study
  9. 9. Effectiveness, and repository out-of-the- box repository strategies Strength Robots detected by Recall (%) Precision (%) No images requested 98.34 75.48 No referring site 96.27 52.25 List of IP addresses 69.29 99.40 HEAD method to access site 32.37 100.00 Agent name declared 26.56 100.00 Access only at night 24.48 50.43 Robots.txt file accessed 17.01 100.00 Time, σ (3s) 2.49 100.00 Time, average (1s) 2.49 75.00 Eprints filters based on number of hits from an IP address per day – similar to time based strategies in the benchmarking study
  10. 10. Effectiveness, and repository out-of-the- box repository strategies Strength Robots detected by Recall (%) Precision (%) No images requested 98.34 75.48 No referring site 96.27 52.25 List of IP addresses 69.29 99.40 HEAD method to access site 32.37 100.00 Agent name declared 26.56 100.00 Access only at night 24.48 50.43 Robots.txt file accessed 17.01 100.00 Time, σ (3s) 2.49 100.00 Time, average (1s) 2.49 75.00
  11. 11. Centralised strategy: IRUS-UK • Collects and filters statistics from 84 DSpace and Eprints repositories • COUNTER compliant usage statistics • Robot exclusion: – The COUNTER list of agent names – All downloads from IP addresses where there are more than 200 downloads in a day from a repository – Most downloads from IP addresses where there are more than 100 downloads in a day from a repository • Work commissioned to investigate feasibility and approach to adaptive filtering based on usage behaviour
  12. 12. Sources by slide 1 Bill Gosper's Glider Gun in action—a variation of Conway's Game of Life. Johan G. Bontes. <https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life#/media/File:Gosper s_glider_gun.gif> 3, 6, 7 Doran, D.; Gokhale, S.S. Web robot detection techniques: overview and limitations. Data Mining and Knowledge Discovery (2011) 22:183-210. DOI:10.1007/s10618-010-0180-z 4 http://pixabay.com/static/uploads/photo/2015/05/31/12/09/wooden- 791421_640.jpg 5 Bad Robot Productions logo. 2001-2008. <https://en.wikipedia.org/wiki/Bad_Robot_Productions#/media/File:Bad_Robot_ Productions_logo.jpg> 6 Burroway, J., Loard, J. V. The Giant Jam Sandwich. 1972, Houghton Mifflin Harcourt. 8, 9, 10 Nick Geens, Johan Huysmans, Jan Vanthienen. Evaluation of Web Robot Discovery Techniques: A Benchmarking Study. Advances in Data Mining. Applications in Medicine, Web Mining, Marketing, Image and Signal Mining. Lecture Notes in Computer Science 4065, pp 121-130, 2006. DOI:10.1007/11790853_10 8 Diggory, Mark. SOLR Statistics. DSpace Wiki. <https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics> 9 Joint, Nicholas. [EP-tech] Re: Please change the way IRstats works. Eprints_tech mailing list 2011-10-13 <http://www.eprints.org/tech.php/15695.html> 11 IRUS-UK. <http://www.irus.mimas.ac.uk/participants/>
  13. 13. Thank you!

×