Presented at the CONUL Conference, July 2015, Athlone, Ireland by Joseph Greene, University College Dublin.
Abstract
Web robots have become an enormous problem and must be considered when collecting and analysing web usage statistics. This is particularly problematic for institutional repositories, as the major platforms (DSpace and EPrints) have only basic robot detection and filtration capabilities for their native statistics packages. These systems, as well as a DSpace extension, the University of Minho StatisticsAddOn, will be described along with general information about web robots such as definition, usefulness and behaviour. Work is currently being done by COUNTER and IRUS-UK in this area. Their work will be discussed in the context of its applicability in Ireland.
Biography
Joseph received an MLIS in 2005 from Louisiana State University in Baton Rouge, Louisiana. He worked in East Baton Rouge Parish Libraries for three years before moving to Ireland and working with the Irish Viritual Research Library and Archive (IVRLA) at University College Dublin. Joseph became UCD’s systems librarian in 2008. He completed a diploma in project management in 2009 and has been responsible for the UCD institutional repository since 2008.
Robot Hunter, or, precisely what I thought I wouldn't be doing when I became a librarian - Joseph Greene
1. Leabharlann UCD
An Coláiste Ollscoile, Baile
Átha Cliath,
Belfield, Baile Átha Cliath 4,
Eire
UCD Library
University College Dublin,
Belfield, Dublin 4, Ireland
Robot hunter
Or, precisely what I thought I wouldn’t
be doing when I became a librarian
Joseph Greene
Research Repository Librarian
joseph.greene@ucd.ie
http://researchrepository.ucd.ie
2. Counting downloads
• Open Access repositories make science and
scholarship accessible, and we need to
demonstrate our value
• Simple question: how often are these papers
used? How many times have they been
downloaded?
3. Enter the Robot
• At least 18% of web requests are from robots
• Less than half can be accounted for by the five
main search engines
• At Research Repository UCD, 2/3rds of our
repository’s downloads are marked as web robots
4. What are you talking about?
Internet robot, Web robot, automated agent,
crawler, spider, bot: any programme that visits
websites and systematically retrieves information
from them
5. Good and bad
• Search engines, link verifiers, computer science
experiments
• Gathering content for spam, phishing and copycat
sites, artificially improving a website’s ranking
(spamdexing), looking for security holes, DDoS
attacks…………
6. ‘And the noisy, nasty nuisance grew, ‘til
the villagers cried, “What can we do?”’
Detection methods:
• Blocking robots in real-time:
Turing tests
• Detecting later and removing
from statistics
7. Appropriate, but problematic methods
for repositories
• Excluding known robots by user-agent name
– Easily faked or omitted
• Excluding by IP address
– DHCP, and list is growing exponentially
• Usage pattern analysis: query rate and resources
requested
– Expensive to automate
• Machine learning: training decision trees, neural
nets and/or statistical systems
– Did you say expensive???
• Combined approaches
8. Effectiveness, and repository out-of-the-
box repository strategies
Strength
Robots detected by Recall (%) Precision (%)
No images requested 98.34 75.48
No referring site 96.27 52.25
List of IP addresses 69.29 99.40
HEAD method to access site 32.37 100.00
Agent name declared 26.56 100.00
Access only at night 24.48 50.43
Robots.txt file accessed 17.01 100.00
Time, σ (3s) 2.49 100.00
Time, average (1s) 2.49 75.00
DSpace uses IP addresses of
known agents – much weaker than
in the benchmarking study
9. Effectiveness, and repository out-of-the-
box repository strategies
Strength
Robots detected by Recall (%) Precision (%)
No images requested 98.34 75.48
No referring site 96.27 52.25
List of IP addresses 69.29 99.40
HEAD method to access site 32.37 100.00
Agent name declared 26.56 100.00
Access only at night 24.48 50.43
Robots.txt file accessed 17.01 100.00
Time, σ (3s) 2.49 100.00
Time, average (1s) 2.49 75.00
Eprints filters based on number of
hits from an IP address per day –
similar to time based strategies in
the benchmarking study
10. Effectiveness, and repository out-of-the-
box repository strategies
Strength
Robots detected by Recall (%) Precision (%)
No images requested 98.34 75.48
No referring site 96.27 52.25
List of IP addresses 69.29 99.40
HEAD method to access site 32.37 100.00
Agent name declared 26.56 100.00
Access only at night 24.48 50.43
Robots.txt file accessed 17.01 100.00
Time, σ (3s) 2.49 100.00
Time, average (1s) 2.49 75.00
11. Centralised strategy: IRUS-UK
• Collects and filters statistics from 84 DSpace and
Eprints repositories
• COUNTER compliant usage statistics
• Robot exclusion:
– The COUNTER list of agent names
– All downloads from IP addresses where there are
more than 200 downloads in a day from a
repository
– Most downloads from IP addresses where there are
more than 100 downloads in a day from a
repository
• Work commissioned to investigate feasibility and
approach to adaptive filtering based on usage
behaviour
12. Sources by slide
1 Bill Gosper's Glider Gun in action—a variation of Conway's Game of Life. Johan G.
Bontes.
<https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life#/media/File:Gosper
s_glider_gun.gif>
3, 6, 7 Doran, D.; Gokhale, S.S. Web robot detection techniques: overview and
limitations. Data Mining and Knowledge Discovery (2011) 22:183-210.
DOI:10.1007/s10618-010-0180-z
4 http://pixabay.com/static/uploads/photo/2015/05/31/12/09/wooden-
791421_640.jpg
5 Bad Robot Productions logo. 2001-2008.
<https://en.wikipedia.org/wiki/Bad_Robot_Productions#/media/File:Bad_Robot_
Productions_logo.jpg>
6 Burroway, J., Loard, J. V. The Giant Jam Sandwich. 1972, Houghton Mifflin Harcourt.
8, 9, 10 Nick Geens, Johan Huysmans, Jan Vanthienen. Evaluation of Web Robot
Discovery Techniques: A Benchmarking Study. Advances in Data Mining.
Applications in Medicine, Web Mining, Marketing, Image and Signal Mining.
Lecture Notes in Computer Science 4065, pp 121-130, 2006.
DOI:10.1007/11790853_10
8 Diggory, Mark. SOLR Statistics. DSpace Wiki.
<https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics>
9 Joint, Nicholas. [EP-tech] Re: Please change the way IRstats works. Eprints_tech
mailing list 2011-10-13 <http://www.eprints.org/tech.php/15695.html>
11 IRUS-UK. <http://www.irus.mimas.ac.uk/participants/>