SlideShare a Scribd company logo
1 of 13
Leabharlann UCD
An Coláiste Ollscoile, Baile
Átha Cliath,
Belfield, Baile Átha Cliath 4,
Eire
UCD Library
University College Dublin,
Belfield, Dublin 4, Ireland
Robot hunter
Or, precisely what I thought I wouldn’t
be doing when I became a librarian
Joseph Greene
Research Repository Librarian
joseph.greene@ucd.ie
http://researchrepository.ucd.ie
Counting downloads
• Open Access repositories make science and
scholarship accessible, and we need to
demonstrate our value
• Simple question: how often are these papers
used? How many times have they been
downloaded?
Enter the Robot
• At least 18% of web requests are from robots
• Less than half can be accounted for by the five
main search engines
• At Research Repository UCD, 2/3rds of our
repository’s downloads are marked as web robots
What are you talking about?
Internet robot, Web robot, automated agent,
crawler, spider, bot: any programme that visits
websites and systematically retrieves information
from them
Good and bad
• Search engines, link verifiers, computer science
experiments
• Gathering content for spam, phishing and copycat
sites, artificially improving a website’s ranking
(spamdexing), looking for security holes, DDoS
attacks…………
‘And the noisy, nasty nuisance grew, ‘til
the villagers cried, “What can we do?”’
Detection methods:
• Blocking robots in real-time:
Turing tests
• Detecting later and removing
from statistics
Appropriate, but problematic methods
for repositories
• Excluding known robots by user-agent name
– Easily faked or omitted
• Excluding by IP address
– DHCP, and list is growing exponentially
• Usage pattern analysis: query rate and resources
requested
– Expensive to automate
• Machine learning: training decision trees, neural
nets and/or statistical systems
– Did you say expensive???
• Combined approaches
Effectiveness, and repository out-of-the-
box repository strategies
Strength
Robots detected by Recall (%) Precision (%)
No images requested 98.34 75.48
No referring site 96.27 52.25
List of IP addresses 69.29 99.40
HEAD method to access site 32.37 100.00
Agent name declared 26.56 100.00
Access only at night 24.48 50.43
Robots.txt file accessed 17.01 100.00
Time, σ (3s) 2.49 100.00
Time, average (1s) 2.49 75.00
DSpace uses IP addresses of
known agents – much weaker than
in the benchmarking study
Effectiveness, and repository out-of-the-
box repository strategies
Strength
Robots detected by Recall (%) Precision (%)
No images requested 98.34 75.48
No referring site 96.27 52.25
List of IP addresses 69.29 99.40
HEAD method to access site 32.37 100.00
Agent name declared 26.56 100.00
Access only at night 24.48 50.43
Robots.txt file accessed 17.01 100.00
Time, σ (3s) 2.49 100.00
Time, average (1s) 2.49 75.00
Eprints filters based on number of
hits from an IP address per day –
similar to time based strategies in
the benchmarking study
Effectiveness, and repository out-of-the-
box repository strategies
Strength
Robots detected by Recall (%) Precision (%)
No images requested 98.34 75.48
No referring site 96.27 52.25
List of IP addresses 69.29 99.40
HEAD method to access site 32.37 100.00
Agent name declared 26.56 100.00
Access only at night 24.48 50.43
Robots.txt file accessed 17.01 100.00
Time, σ (3s) 2.49 100.00
Time, average (1s) 2.49 75.00
Centralised strategy: IRUS-UK
• Collects and filters statistics from 84 DSpace and
Eprints repositories
• COUNTER compliant usage statistics
• Robot exclusion:
– The COUNTER list of agent names
– All downloads from IP addresses where there are
more than 200 downloads in a day from a
repository
– Most downloads from IP addresses where there are
more than 100 downloads in a day from a
repository
• Work commissioned to investigate feasibility and
approach to adaptive filtering based on usage
behaviour
Sources by slide
1 Bill Gosper's Glider Gun in action—a variation of Conway's Game of Life. Johan G.
Bontes.
<https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life#/media/File:Gosper
s_glider_gun.gif>
3, 6, 7 Doran, D.; Gokhale, S.S. Web robot detection techniques: overview and
limitations. Data Mining and Knowledge Discovery (2011) 22:183-210.
DOI:10.1007/s10618-010-0180-z
4 http://pixabay.com/static/uploads/photo/2015/05/31/12/09/wooden-
791421_640.jpg
5 Bad Robot Productions logo. 2001-2008.
<https://en.wikipedia.org/wiki/Bad_Robot_Productions#/media/File:Bad_Robot_
Productions_logo.jpg>
6 Burroway, J., Loard, J. V. The Giant Jam Sandwich. 1972, Houghton Mifflin Harcourt.
8, 9, 10 Nick Geens, Johan Huysmans, Jan Vanthienen. Evaluation of Web Robot
Discovery Techniques: A Benchmarking Study. Advances in Data Mining.
Applications in Medicine, Web Mining, Marketing, Image and Signal Mining.
Lecture Notes in Computer Science 4065, pp 121-130, 2006.
DOI:10.1007/11790853_10
8 Diggory, Mark. SOLR Statistics. DSpace Wiki.
<https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics>
9 Joint, Nicholas. [EP-tech] Re: Please change the way IRstats works. Eprints_tech
mailing list 2011-10-13 <http://www.eprints.org/tech.php/15695.html>
11 IRUS-UK. <http://www.irus.mimas.ac.uk/participants/>
Thank you!

More Related Content

Similar to Robot Hunter, or, precisely what I thought I wouldn't be doing when I became a librarian - Joseph Greene

#iCanHazRobot?: improved robot detection for IR usage statistics
#iCanHazRobot?: improved robot detection for IR usage statistics#iCanHazRobot?: improved robot detection for IR usage statistics
#iCanHazRobot?: improved robot detection for IR usage statisticsUCD Library
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeEdward Baker
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeVince Smith
 
Hunting: Defense Against The Dark Arts v2
Hunting: Defense Against The Dark Arts v2Hunting: Defense Against The Dark Arts v2
Hunting: Defense Against The Dark Arts v2Spyglass Security
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science ServicesIan Foster
 
Hunting: Defense Against The Dark Arts - BSides Philadelphia - 2016
Hunting: Defense Against The Dark Arts - BSides Philadelphia - 2016Hunting: Defense Against The Dark Arts - BSides Philadelphia - 2016
Hunting: Defense Against The Dark Arts - BSides Philadelphia - 2016Danny Akacki
 
The Web Application Hackers Toolchain
The Web Application Hackers ToolchainThe Web Application Hackers Toolchain
The Web Application Hackers Toolchainjasonhaddix
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryScience as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryIan Foster
 
Applying Machine Learning to Network Security Monitoring - BayThreat 2013
Applying Machine Learning to Network Security Monitoring - BayThreat 2013Applying Machine Learning to Network Security Monitoring - BayThreat 2013
Applying Machine Learning to Network Security Monitoring - BayThreat 2013Alex Pinto
 
Dafgjgghhghfhjgghjhgy06-Footprinting.pptx
Dafgjgghhghfhjgghjhgy06-Footprinting.pptxDafgjgghhghfhjgghjhgy06-Footprinting.pptx
Dafgjgghhghfhjgghjhgy06-Footprinting.pptxAlfredObia1
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013Kirill Osipov
 
DEEPSEC 2013: Malware Datamining And Attribution
DEEPSEC 2013: Malware Datamining And AttributionDEEPSEC 2013: Malware Datamining And Attribution
DEEPSEC 2013: Malware Datamining And AttributionMichael Boman
 
Phinding Phish: An Evaluation of Anti-Phishing Toolbars, at NDSS 2007
Phinding Phish: An Evaluation of Anti-Phishing Toolbars, at NDSS 2007Phinding Phish: An Evaluation of Anti-Phishing Toolbars, at NDSS 2007
Phinding Phish: An Evaluation of Anti-Phishing Toolbars, at NDSS 2007Jason Hong
 
Chapter 2 for cyber security examination.pptx
Chapter 2 for cyber security examination.pptxChapter 2 for cyber security examination.pptx
Chapter 2 for cyber security examination.pptxMahdiHasanSowrav
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster
 
What Open Data and Open Source can do for Sri Lanka?
What Open Data and Open Source can do for Sri Lanka?What Open Data and Open Source can do for Sri Lanka?
What Open Data and Open Source can do for Sri Lanka?Srinath Perera
 
Autodiscovery or The long tail of open data
Autodiscovery or The long tail of open dataAutodiscovery or The long tail of open data
Autodiscovery or The long tail of open dataConnected Data World
 

Similar to Robot Hunter, or, precisely what I thought I wouldn't be doing when I became a librarian - Joseph Greene (20)

#iCanHazRobot?: improved robot detection for IR usage statistics
#iCanHazRobot?: improved robot detection for IR usage statistics#iCanHazRobot?: improved robot detection for IR usage statistics
#iCanHazRobot?: improved robot detection for IR usage statistics
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 
Hunting: Defense Against The Dark Arts v2
Hunting: Defense Against The Dark Arts v2Hunting: Defense Against The Dark Arts v2
Hunting: Defense Against The Dark Arts v2
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science Services
 
Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 
Hunting: Defense Against The Dark Arts - BSides Philadelphia - 2016
Hunting: Defense Against The Dark Arts - BSides Philadelphia - 2016Hunting: Defense Against The Dark Arts - BSides Philadelphia - 2016
Hunting: Defense Against The Dark Arts - BSides Philadelphia - 2016
 
The Web Application Hackers Toolchain
The Web Application Hackers ToolchainThe Web Application Hackers Toolchain
The Web Application Hackers Toolchain
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryScience as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate Discovery
 
Applying Machine Learning to Network Security Monitoring - BayThreat 2013
Applying Machine Learning to Network Security Monitoring - BayThreat 2013Applying Machine Learning to Network Security Monitoring - BayThreat 2013
Applying Machine Learning to Network Security Monitoring - BayThreat 2013
 
Dafgjgghhghfhjgghjhgy06-Footprinting.pptx
Dafgjgghhghfhjgghjhgy06-Footprinting.pptxDafgjgghhghfhjgghjhgy06-Footprinting.pptx
Dafgjgghhghfhjgghjhgy06-Footprinting.pptx
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
 
DEEPSEC 2013: Malware Datamining And Attribution
DEEPSEC 2013: Malware Datamining And AttributionDEEPSEC 2013: Malware Datamining And Attribution
DEEPSEC 2013: Malware Datamining And Attribution
 
Phinding Phish: An Evaluation of Anti-Phishing Toolbars, at NDSS 2007
Phinding Phish: An Evaluation of Anti-Phishing Toolbars, at NDSS 2007Phinding Phish: An Evaluation of Anti-Phishing Toolbars, at NDSS 2007
Phinding Phish: An Evaluation of Anti-Phishing Toolbars, at NDSS 2007
 
Bots & spiders
Bots & spidersBots & spiders
Bots & spiders
 
Chapter 2 for cyber security examination.pptx
Chapter 2 for cyber security examination.pptxChapter 2 for cyber security examination.pptx
Chapter 2 for cyber security examination.pptx
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 
What Open Data and Open Source can do for Sri Lanka?
What Open Data and Open Source can do for Sri Lanka?What Open Data and Open Source can do for Sri Lanka?
What Open Data and Open Source can do for Sri Lanka?
 
Autodiscovery or The long tail of open data
Autodiscovery or The long tail of open dataAutodiscovery or The long tail of open data
Autodiscovery or The long tail of open data
 
Diadem 1.0
Diadem 1.0Diadem 1.0
Diadem 1.0
 

More from CONUL Conference

Library.Now – Welcome to the future!
Library.Now – Welcome to the future!Library.Now – Welcome to the future!
Library.Now – Welcome to the future!CONUL Conference
 
Towards a CONUL Collective Collection - Christoph Schmidt Supprian (Trinity ...
Towards a CONUL Collective Collection -  Christoph Schmidt Supprian (Trinity ...Towards a CONUL Collective Collection -  Christoph Schmidt Supprian (Trinity ...
Towards a CONUL Collective Collection - Christoph Schmidt Supprian (Trinity ...CONUL Conference
 
The newly urgent future for libraries: A view from MIT - Chris Bourg (Directo...
The newly urgent future for libraries: A view from MIT - Chris Bourg (Directo...The newly urgent future for libraries: A view from MIT - Chris Bourg (Directo...
The newly urgent future for libraries: A view from MIT - Chris Bourg (Directo...CONUL Conference
 
What we learned - Dr. Melissa Highton (Director of Learning, Teaching and Web...
What we learned - Dr. Melissa Highton (Director of Learning, Teaching and Web...What we learned - Dr. Melissa Highton (Director of Learning, Teaching and Web...
What we learned - Dr. Melissa Highton (Director of Learning, Teaching and Web...CONUL Conference
 
Launched into the Digital Age: Content Creation at Maynooth University Specia...
Launched into the Digital Age: Content Creation at Maynooth University Specia...Launched into the Digital Age: Content Creation at Maynooth University Specia...
Launched into the Digital Age: Content Creation at Maynooth University Specia...CONUL Conference
 
Streamlining Metadata Supply for ALL - Heather Sherman (BDS)
Streamlining Metadata Supply for ALL - Heather Sherman (BDS)Streamlining Metadata Supply for ALL - Heather Sherman (BDS)
Streamlining Metadata Supply for ALL - Heather Sherman (BDS)CONUL Conference
 
Digital copyright trends: implications for collections and services - David ...
Digital copyright trends: implications for collections and services  - David ...Digital copyright trends: implications for collections and services  - David ...
Digital copyright trends: implications for collections and services - David ...CONUL Conference
 
‘Establishing the Innopharma Education library’ - Colm O’Connor (Innopharma E...
‘Establishing the Innopharma Education library’ - Colm O’Connor (Innopharma E...‘Establishing the Innopharma Education library’ - Colm O’Connor (Innopharma E...
‘Establishing the Innopharma Education library’ - Colm O’Connor (Innopharma E...CONUL Conference
 
RCSI Repository: Implementing and integrating a Figshare-powered institutiona...
RCSI Repository: Implementing and integrating a Figshare-powered institutiona...RCSI Repository: Implementing and integrating a Figshare-powered institutiona...
RCSI Repository: Implementing and integrating a Figshare-powered institutiona...CONUL Conference
 
The (Library) Carpenters - we've only just begun - Sinead Keogh (UL), David K...
The (Library) Carpenters - we've only just begun - Sinead Keogh (UL), David K...The (Library) Carpenters - we've only just begun - Sinead Keogh (UL), David K...
The (Library) Carpenters - we've only just begun - Sinead Keogh (UL), David K...CONUL Conference
 
Taking a lead on digital literacy for students; a case study from UL - Michel...
Taking a lead on digital literacy for students; a case study from UL - Michel...Taking a lead on digital literacy for students; a case study from UL - Michel...
Taking a lead on digital literacy for students; a case study from UL - Michel...CONUL Conference
 
New Frontiers of Digital Access: The development and delivery of Virtual Read...
New Frontiers of Digital Access: The development and delivery of Virtual Read...New Frontiers of Digital Access: The development and delivery of Virtual Read...
New Frontiers of Digital Access: The development and delivery of Virtual Read...CONUL Conference
 
The Digital Learning Librarian role at UCD Library: a case study in social in...
The Digital Learning Librarian role at UCD Library: a case study in social in...The Digital Learning Librarian role at UCD Library: a case study in social in...
The Digital Learning Librarian role at UCD Library: a case study in social in...CONUL Conference
 
Cataloguing the “troubles” - Ruth O’Hara (Maynooth University)
Cataloguing the “troubles” - Ruth O’Hara (Maynooth University)Cataloguing the “troubles” - Ruth O’Hara (Maynooth University)
Cataloguing the “troubles” - Ruth O’Hara (Maynooth University)CONUL Conference
 
Livin’ In The Future – The National Library of Ireland’s Web Archive -Maria R...
Livin’ In The Future – The National Library of Ireland’s Web Archive -Maria R...Livin’ In The Future – The National Library of Ireland’s Web Archive -Maria R...
Livin’ In The Future – The National Library of Ireland’s Web Archive -Maria R...CONUL Conference
 
Unlocking the Fagel Collection: From 1802 to 2022 and beyond - Ann-Marie Hans...
Unlocking the Fagel Collection: From 1802 to 2022 and beyond - Ann-Marie Hans...Unlocking the Fagel Collection: From 1802 to 2022 and beyond - Ann-Marie Hans...
Unlocking the Fagel Collection: From 1802 to 2022 and beyond - Ann-Marie Hans...CONUL Conference
 
OER in the Future Library? NUI Galway Library’s Open Press Pilot Project - Kr...
OER in the Future Library? NUI Galway Library’s Open Press Pilot Project - Kr...OER in the Future Library? NUI Galway Library’s Open Press Pilot Project - Kr...
OER in the Future Library? NUI Galway Library’s Open Press Pilot Project - Kr...CONUL Conference
 
Riding two horses: the successes, the challenges and the future of IReL’s tra...
Riding two horses: the successes, the challenges and the future of IReL’s tra...Riding two horses: the successes, the challenges and the future of IReL’s tra...
Riding two horses: the successes, the challenges and the future of IReL’s tra...CONUL Conference
 
DMP online at UCD - Jenny O’Neill, (University College Dublin)
DMP online at UCD - Jenny O’Neill, (University College Dublin)DMP online at UCD - Jenny O’Neill, (University College Dublin)
DMP online at UCD - Jenny O’Neill, (University College Dublin)CONUL Conference
 
Developing Emotionally Intelligent Work Teams - Peter Reilly (University of L...
Developing Emotionally Intelligent Work Teams - Peter Reilly (University of L...Developing Emotionally Intelligent Work Teams - Peter Reilly (University of L...
Developing Emotionally Intelligent Work Teams - Peter Reilly (University of L...CONUL Conference
 

More from CONUL Conference (20)

Library.Now – Welcome to the future!
Library.Now – Welcome to the future!Library.Now – Welcome to the future!
Library.Now – Welcome to the future!
 
Towards a CONUL Collective Collection - Christoph Schmidt Supprian (Trinity ...
Towards a CONUL Collective Collection -  Christoph Schmidt Supprian (Trinity ...Towards a CONUL Collective Collection -  Christoph Schmidt Supprian (Trinity ...
Towards a CONUL Collective Collection - Christoph Schmidt Supprian (Trinity ...
 
The newly urgent future for libraries: A view from MIT - Chris Bourg (Directo...
The newly urgent future for libraries: A view from MIT - Chris Bourg (Directo...The newly urgent future for libraries: A view from MIT - Chris Bourg (Directo...
The newly urgent future for libraries: A view from MIT - Chris Bourg (Directo...
 
What we learned - Dr. Melissa Highton (Director of Learning, Teaching and Web...
What we learned - Dr. Melissa Highton (Director of Learning, Teaching and Web...What we learned - Dr. Melissa Highton (Director of Learning, Teaching and Web...
What we learned - Dr. Melissa Highton (Director of Learning, Teaching and Web...
 
Launched into the Digital Age: Content Creation at Maynooth University Specia...
Launched into the Digital Age: Content Creation at Maynooth University Specia...Launched into the Digital Age: Content Creation at Maynooth University Specia...
Launched into the Digital Age: Content Creation at Maynooth University Specia...
 
Streamlining Metadata Supply for ALL - Heather Sherman (BDS)
Streamlining Metadata Supply for ALL - Heather Sherman (BDS)Streamlining Metadata Supply for ALL - Heather Sherman (BDS)
Streamlining Metadata Supply for ALL - Heather Sherman (BDS)
 
Digital copyright trends: implications for collections and services - David ...
Digital copyright trends: implications for collections and services  - David ...Digital copyright trends: implications for collections and services  - David ...
Digital copyright trends: implications for collections and services - David ...
 
‘Establishing the Innopharma Education library’ - Colm O’Connor (Innopharma E...
‘Establishing the Innopharma Education library’ - Colm O’Connor (Innopharma E...‘Establishing the Innopharma Education library’ - Colm O’Connor (Innopharma E...
‘Establishing the Innopharma Education library’ - Colm O’Connor (Innopharma E...
 
RCSI Repository: Implementing and integrating a Figshare-powered institutiona...
RCSI Repository: Implementing and integrating a Figshare-powered institutiona...RCSI Repository: Implementing and integrating a Figshare-powered institutiona...
RCSI Repository: Implementing and integrating a Figshare-powered institutiona...
 
The (Library) Carpenters - we've only just begun - Sinead Keogh (UL), David K...
The (Library) Carpenters - we've only just begun - Sinead Keogh (UL), David K...The (Library) Carpenters - we've only just begun - Sinead Keogh (UL), David K...
The (Library) Carpenters - we've only just begun - Sinead Keogh (UL), David K...
 
Taking a lead on digital literacy for students; a case study from UL - Michel...
Taking a lead on digital literacy for students; a case study from UL - Michel...Taking a lead on digital literacy for students; a case study from UL - Michel...
Taking a lead on digital literacy for students; a case study from UL - Michel...
 
New Frontiers of Digital Access: The development and delivery of Virtual Read...
New Frontiers of Digital Access: The development and delivery of Virtual Read...New Frontiers of Digital Access: The development and delivery of Virtual Read...
New Frontiers of Digital Access: The development and delivery of Virtual Read...
 
The Digital Learning Librarian role at UCD Library: a case study in social in...
The Digital Learning Librarian role at UCD Library: a case study in social in...The Digital Learning Librarian role at UCD Library: a case study in social in...
The Digital Learning Librarian role at UCD Library: a case study in social in...
 
Cataloguing the “troubles” - Ruth O’Hara (Maynooth University)
Cataloguing the “troubles” - Ruth O’Hara (Maynooth University)Cataloguing the “troubles” - Ruth O’Hara (Maynooth University)
Cataloguing the “troubles” - Ruth O’Hara (Maynooth University)
 
Livin’ In The Future – The National Library of Ireland’s Web Archive -Maria R...
Livin’ In The Future – The National Library of Ireland’s Web Archive -Maria R...Livin’ In The Future – The National Library of Ireland’s Web Archive -Maria R...
Livin’ In The Future – The National Library of Ireland’s Web Archive -Maria R...
 
Unlocking the Fagel Collection: From 1802 to 2022 and beyond - Ann-Marie Hans...
Unlocking the Fagel Collection: From 1802 to 2022 and beyond - Ann-Marie Hans...Unlocking the Fagel Collection: From 1802 to 2022 and beyond - Ann-Marie Hans...
Unlocking the Fagel Collection: From 1802 to 2022 and beyond - Ann-Marie Hans...
 
OER in the Future Library? NUI Galway Library’s Open Press Pilot Project - Kr...
OER in the Future Library? NUI Galway Library’s Open Press Pilot Project - Kr...OER in the Future Library? NUI Galway Library’s Open Press Pilot Project - Kr...
OER in the Future Library? NUI Galway Library’s Open Press Pilot Project - Kr...
 
Riding two horses: the successes, the challenges and the future of IReL’s tra...
Riding two horses: the successes, the challenges and the future of IReL’s tra...Riding two horses: the successes, the challenges and the future of IReL’s tra...
Riding two horses: the successes, the challenges and the future of IReL’s tra...
 
DMP online at UCD - Jenny O’Neill, (University College Dublin)
DMP online at UCD - Jenny O’Neill, (University College Dublin)DMP online at UCD - Jenny O’Neill, (University College Dublin)
DMP online at UCD - Jenny O’Neill, (University College Dublin)
 
Developing Emotionally Intelligent Work Teams - Peter Reilly (University of L...
Developing Emotionally Intelligent Work Teams - Peter Reilly (University of L...Developing Emotionally Intelligent Work Teams - Peter Reilly (University of L...
Developing Emotionally Intelligent Work Teams - Peter Reilly (University of L...
 

Recently uploaded

MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 

Recently uploaded (20)

MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 

Robot Hunter, or, precisely what I thought I wouldn't be doing when I became a librarian - Joseph Greene

  • 1. Leabharlann UCD An Coláiste Ollscoile, Baile Átha Cliath, Belfield, Baile Átha Cliath 4, Eire UCD Library University College Dublin, Belfield, Dublin 4, Ireland Robot hunter Or, precisely what I thought I wouldn’t be doing when I became a librarian Joseph Greene Research Repository Librarian joseph.greene@ucd.ie http://researchrepository.ucd.ie
  • 2. Counting downloads • Open Access repositories make science and scholarship accessible, and we need to demonstrate our value • Simple question: how often are these papers used? How many times have they been downloaded?
  • 3. Enter the Robot • At least 18% of web requests are from robots • Less than half can be accounted for by the five main search engines • At Research Repository UCD, 2/3rds of our repository’s downloads are marked as web robots
  • 4. What are you talking about? Internet robot, Web robot, automated agent, crawler, spider, bot: any programme that visits websites and systematically retrieves information from them
  • 5. Good and bad • Search engines, link verifiers, computer science experiments • Gathering content for spam, phishing and copycat sites, artificially improving a website’s ranking (spamdexing), looking for security holes, DDoS attacks…………
  • 6. ‘And the noisy, nasty nuisance grew, ‘til the villagers cried, “What can we do?”’ Detection methods: • Blocking robots in real-time: Turing tests • Detecting later and removing from statistics
  • 7. Appropriate, but problematic methods for repositories • Excluding known robots by user-agent name – Easily faked or omitted • Excluding by IP address – DHCP, and list is growing exponentially • Usage pattern analysis: query rate and resources requested – Expensive to automate • Machine learning: training decision trees, neural nets and/or statistical systems – Did you say expensive??? • Combined approaches
  • 8. Effectiveness, and repository out-of-the- box repository strategies Strength Robots detected by Recall (%) Precision (%) No images requested 98.34 75.48 No referring site 96.27 52.25 List of IP addresses 69.29 99.40 HEAD method to access site 32.37 100.00 Agent name declared 26.56 100.00 Access only at night 24.48 50.43 Robots.txt file accessed 17.01 100.00 Time, σ (3s) 2.49 100.00 Time, average (1s) 2.49 75.00 DSpace uses IP addresses of known agents – much weaker than in the benchmarking study
  • 9. Effectiveness, and repository out-of-the- box repository strategies Strength Robots detected by Recall (%) Precision (%) No images requested 98.34 75.48 No referring site 96.27 52.25 List of IP addresses 69.29 99.40 HEAD method to access site 32.37 100.00 Agent name declared 26.56 100.00 Access only at night 24.48 50.43 Robots.txt file accessed 17.01 100.00 Time, σ (3s) 2.49 100.00 Time, average (1s) 2.49 75.00 Eprints filters based on number of hits from an IP address per day – similar to time based strategies in the benchmarking study
  • 10. Effectiveness, and repository out-of-the- box repository strategies Strength Robots detected by Recall (%) Precision (%) No images requested 98.34 75.48 No referring site 96.27 52.25 List of IP addresses 69.29 99.40 HEAD method to access site 32.37 100.00 Agent name declared 26.56 100.00 Access only at night 24.48 50.43 Robots.txt file accessed 17.01 100.00 Time, σ (3s) 2.49 100.00 Time, average (1s) 2.49 75.00
  • 11. Centralised strategy: IRUS-UK • Collects and filters statistics from 84 DSpace and Eprints repositories • COUNTER compliant usage statistics • Robot exclusion: – The COUNTER list of agent names – All downloads from IP addresses where there are more than 200 downloads in a day from a repository – Most downloads from IP addresses where there are more than 100 downloads in a day from a repository • Work commissioned to investigate feasibility and approach to adaptive filtering based on usage behaviour
  • 12. Sources by slide 1 Bill Gosper's Glider Gun in action—a variation of Conway's Game of Life. Johan G. Bontes. <https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life#/media/File:Gosper s_glider_gun.gif> 3, 6, 7 Doran, D.; Gokhale, S.S. Web robot detection techniques: overview and limitations. Data Mining and Knowledge Discovery (2011) 22:183-210. DOI:10.1007/s10618-010-0180-z 4 http://pixabay.com/static/uploads/photo/2015/05/31/12/09/wooden- 791421_640.jpg 5 Bad Robot Productions logo. 2001-2008. <https://en.wikipedia.org/wiki/Bad_Robot_Productions#/media/File:Bad_Robot_ Productions_logo.jpg> 6 Burroway, J., Loard, J. V. The Giant Jam Sandwich. 1972, Houghton Mifflin Harcourt. 8, 9, 10 Nick Geens, Johan Huysmans, Jan Vanthienen. Evaluation of Web Robot Discovery Techniques: A Benchmarking Study. Advances in Data Mining. Applications in Medicine, Web Mining, Marketing, Image and Signal Mining. Lecture Notes in Computer Science 4065, pp 121-130, 2006. DOI:10.1007/11790853_10 8 Diggory, Mark. SOLR Statistics. DSpace Wiki. <https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics> 9 Joint, Nicholas. [EP-tech] Re: Please change the way IRstats works. Eprints_tech mailing list 2011-10-13 <http://www.eprints.org/tech.php/15695.html> 11 IRUS-UK. <http://www.irus.mimas.ac.uk/participants/>