SlideShare a Scribd company logo
Leabharlann UCD
An Coláiste Ollscoile, Baile
Átha Cliath,
Belfield, Baile Átha Cliath 4,
Eire
UCD Library
University College Dublin,
Belfield, Dublin 4, Ireland
Joseph Greene
Research Repository Librarian
University College Dublin
joseph.greene@ucd.ie
http://researchrepository.ucd.ie
#iCanHazRobot?
Improved robot detection for IR usage statistics
Open Repositories 2016
Dublin, 14 June
Overview and take-home points
• Usage stats are important
– (go to the Usage Stats panel on Thursday,
16/Jun/2016: 11:00am - 12:30pm)
• Robot filtration is a problem, especially in
repositories
• Robot detection has an exponential effect on
usage stats’ accuracy in repositories
• 2-3 ways to improve DSpace and EPrints’ usage
stats by 20% or more will be demonstrated
Experimental study
• Simple random sample of 2 years of UCD
repository’s download data
– n=341, N=3.3 million; 96.20% certainty
• Manually checked to determine if robot or human
• Applied DSpace, EPrints robot detection
algorithms to the dataset
– This is an EXPERIMENT, simulating algorithms on a
DSpace repository’s usage data and Apache logs
– The data is real, live data, and the algorithms were
very easy to simulate
First finding
85% of unfiltered
repository downloads
come from robots
• This is confirmed in a 2013 IRUS-UK white paper
on 20 IRs; 85% was also found to be robots
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Accuracyofdownloadstats(inverseprecition)
Recall (robots)
Catching more robots improves stats
(But how much depends on the number of robots)
Getbetterstats
Catch more robots
Typical website, 15% robot traffic
OA journal, 40% robot
Internet Archive, 91% robot
OA repositories, 85% robot
Robot detection techniques used
DSpace EPrints
Minho DSpace
Statistics Add-on
Rate of requests ✓3
User agent string ✓ ✓ ✓
robots.txt access ✓
Volume of requests ✓2
✓3
List of known robot IP addresses ✓ ✓
Reverse DNS name lookup ✓1
Trap file ✓
User agents per IP address
Width of traversal in the URL space ✓3
1
Only implemented nominally or experimentally
2
Via the repeat download or ‘double-click’ filter
3
Data available as a configurable report for manual decision making
Measurements used in robot detection
• All measurements are a number between 0 and 1
• Recall: proportion of robots detected
– I can haz robot?
• Precision: true positives in robot detection
– Proportion of discounted downloads that are
actually made by robots (sometimes humans are
counted as robots)
• Accuracy of download stats measured as inverse
precision:
– Proportion of stats that are actually made by
humans
How they perform, out-of-the-box
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DSpace EPrints Minho Minho with
monthly manual
checking
No robot detection
Robot detection in OA IR systems
Recall Precision Negative precision (accuracy of download stats)
Room for improvement?
1. Ability to manually check for outliers
• At UCD, once a month, we check:
– Daily downloads for the last 2-4 months
– Top 10 most downloaded items
– Top 20 downloading IP addresses for the last 2-4
months
0
0.2
0.4
0.6
0.8
1
DSpace Eprints Minho
Robots caught (Recall)
Out-…
0
0.2
0.4
0.6
0.8
1
DSpace Eprints Minho Wihtout robot
detection
Accuracy of reported download stats
(Inverse precision)
Out-of-the-box
With manual checking (outlier exclusion)
2. Recalibrate the EPrints repeat-
download (double-click) filter
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall (robots) Precision (accuracy
of excluded
downloads)
Inverse recall
(legitimate
downloads
accounted for in
stats)
Inverse precision
(accuracy of
reported download
stats)
Overall accuracy
Effect of double-click filter on EPrints’ robot detection and stats
Without double-click filter With double-click filter (out-of-the-box) With recalibrated double-click filter*
𝑻𝒑 + 𝑻𝒏
𝒏
3. Port Minho’s robot detection code (a
log parser) onto DSpace or EPrints
• 1 Java class
• Input is Apache Combined Log Format
• Output is a database update (robot = true field)
– Similar to EPrints' $is_robot variable in Robots.pm,
– Could be modified to update the DSpace 'isBot'
field in the SOLR usage events document
• Requires 2 database tables to store learned
agents and IPs
0
0.2
0.4
0.6
0.8
1
DSpace Eprints Minho
Robots caught (Recall)
0
0.2
0.4
0.6
0.8
1
DSpace Eprints Minho Wihtout robot
detection
Accuracy of reported download stats
(Inverse precision)
Out-of-the-box With Minho log parser
4. Combine two or more techniques
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DSpace Eprints Minho
Robots caught
(Recall)
Out-of-the-box
With manual
checking (outlier
exclusion)
With recalibrated
double click filter*
With Minho log
parser
With Minho and
outliers
Minho, outliers, and
recalibrated double-
click*
4. Combine two or more techniques
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DSpace Eprints Minho Wihtout robot
detection
Accuracy of reported download stats
(Inverse precision)
Out-of-the-box
With manual checking
(outlier exclusion)
With recalibrated
double click filter*
With Minho log parser
With Minho and
outliers
Minho, outliers, and
recalibrated double-
click*
Thank you!

More Related Content

Viewers also liked

Using a consultancy to assist in developing the UCD vision for the future onl...
Using a consultancy to assist in developing the UCD vision for the future onl...Using a consultancy to assist in developing the UCD vision for the future onl...
Using a consultancy to assist in developing the UCD vision for the future onl...
UCD Library
 
Les possibilitats d’Internet aplicades a l’agricultura ecològica
Les possibilitats d’Internet aplicades a l’agricultura ecològicaLes possibilitats d’Internet aplicades a l’agricultura ecològica
Les possibilitats d’Internet aplicades a l’agricultura ecològica
Marc Garriga
 
CII S'Marketing Convention 2009
CII S'Marketing Convention 2009CII S'Marketing Convention 2009
CII S'Marketing Convention 2009
managemarketing
 
Is peer review peerless? Author: Tony Eklof
Is peer review peerless? Author: Tony EklofIs peer review peerless? Author: Tony Eklof
Is peer review peerless? Author: Tony Eklof
UCD Library
 
El Gobierno Abierto es la respuesta, ¿pero cuál era la pregunta?
El Gobierno Abierto es la respuesta, ¿pero cuál era la pregunta?El Gobierno Abierto es la respuesta, ¿pero cuál era la pregunta?
El Gobierno Abierto es la respuesta, ¿pero cuál era la pregunta?
Marc Garriga
 
Dades Obertes. El valor del coneixement lliure.
Dades Obertes. El valor del coneixement lliure.Dades Obertes. El valor del coneixement lliure.
Dades Obertes. El valor del coneixement lliure.
Marc Garriga
 
OpenGovernment
OpenGovernmentOpenGovernment
OpenGovernment
Marc Garriga
 
Weiying1新生儿
Weiying1新生儿Weiying1新生儿
Weiying1新生儿Deep Deep
 
Andy warhol . Raul and Gerard
 Andy warhol . Raul and Gerard Andy warhol . Raul and Gerard
Andy warhol . Raul and GerardIrisat
 
Курсовая работа
Курсовая работаКурсовая работа
Курсовая работа
ivan_z
 
Dynasties
DynastiesDynasties
Dynasties
kashaff noor
 
Loex 2008 (P2)
Loex 2008 (P2)Loex 2008 (P2)
Loex 2008 (P2)
oreinaue
 
On the shelf in time : developing a strategy to improve reading list support....
On the shelf in time : developing a strategy to improve reading list support....On the shelf in time : developing a strategy to improve reading list support....
On the shelf in time : developing a strategy to improve reading list support....
UCD Library
 
Resource description and new media : challenges and opportunities. Authors: E...
Resource description and new media : challenges and opportunities. Authors: E...Resource description and new media : challenges and opportunities. Authors: E...
Resource description and new media : challenges and opportunities. Authors: E...
UCD Library
 
The library as place. Author: Peter Hickey
The library as place. Author: Peter HickeyThe library as place. Author: Peter Hickey
The library as place. Author: Peter Hickey
UCD Library
 
Pharmacy Businesslaw2
Pharmacy Businesslaw2Pharmacy Businesslaw2
Pharmacy Businesslaw2
shyjesta
 
Graphis Feature
Graphis FeatureGraphis Feature
Graphis Feature
Caylor Solutions, Inc.
 
Finishing the Jigsaw: consolidating and profiling the plagiarism awareness se...
Finishing the Jigsaw: consolidating and profiling the plagiarism awareness se...Finishing the Jigsaw: consolidating and profiling the plagiarism awareness se...
Finishing the Jigsaw: consolidating and profiling the plagiarism awareness se...
UCD Library
 

Viewers also liked (20)

Using a consultancy to assist in developing the UCD vision for the future onl...
Using a consultancy to assist in developing the UCD vision for the future onl...Using a consultancy to assist in developing the UCD vision for the future onl...
Using a consultancy to assist in developing the UCD vision for the future onl...
 
Les possibilitats d’Internet aplicades a l’agricultura ecològica
Les possibilitats d’Internet aplicades a l’agricultura ecològicaLes possibilitats d’Internet aplicades a l’agricultura ecològica
Les possibilitats d’Internet aplicades a l’agricultura ecològica
 
CII S'Marketing Convention 2009
CII S'Marketing Convention 2009CII S'Marketing Convention 2009
CII S'Marketing Convention 2009
 
Is peer review peerless? Author: Tony Eklof
Is peer review peerless? Author: Tony EklofIs peer review peerless? Author: Tony Eklof
Is peer review peerless? Author: Tony Eklof
 
El Gobierno Abierto es la respuesta, ¿pero cuál era la pregunta?
El Gobierno Abierto es la respuesta, ¿pero cuál era la pregunta?El Gobierno Abierto es la respuesta, ¿pero cuál era la pregunta?
El Gobierno Abierto es la respuesta, ¿pero cuál era la pregunta?
 
Dades Obertes. El valor del coneixement lliure.
Dades Obertes. El valor del coneixement lliure.Dades Obertes. El valor del coneixement lliure.
Dades Obertes. El valor del coneixement lliure.
 
OpenGovernment
OpenGovernmentOpenGovernment
OpenGovernment
 
Weiying1新生儿
Weiying1新生儿Weiying1新生儿
Weiying1新生儿
 
Presentation6
Presentation6Presentation6
Presentation6
 
Noms
NomsNoms
Noms
 
Andy warhol . Raul and Gerard
 Andy warhol . Raul and Gerard Andy warhol . Raul and Gerard
Andy warhol . Raul and Gerard
 
Курсовая работа
Курсовая работаКурсовая работа
Курсовая работа
 
Dynasties
DynastiesDynasties
Dynasties
 
Loex 2008 (P2)
Loex 2008 (P2)Loex 2008 (P2)
Loex 2008 (P2)
 
On the shelf in time : developing a strategy to improve reading list support....
On the shelf in time : developing a strategy to improve reading list support....On the shelf in time : developing a strategy to improve reading list support....
On the shelf in time : developing a strategy to improve reading list support....
 
Resource description and new media : challenges and opportunities. Authors: E...
Resource description and new media : challenges and opportunities. Authors: E...Resource description and new media : challenges and opportunities. Authors: E...
Resource description and new media : challenges and opportunities. Authors: E...
 
The library as place. Author: Peter Hickey
The library as place. Author: Peter HickeyThe library as place. Author: Peter Hickey
The library as place. Author: Peter Hickey
 
Pharmacy Businesslaw2
Pharmacy Businesslaw2Pharmacy Businesslaw2
Pharmacy Businesslaw2
 
Graphis Feature
Graphis FeatureGraphis Feature
Graphis Feature
 
Finishing the Jigsaw: consolidating and profiling the plagiarism awareness se...
Finishing the Jigsaw: consolidating and profiling the plagiarism awareness se...Finishing the Jigsaw: consolidating and profiling the plagiarism awareness se...
Finishing the Jigsaw: consolidating and profiling the plagiarism awareness se...
 

Similar to #iCanHazRobot?: improved robot detection for IR usage statistics

Developing COUNTER Standards to Measure the Use of Open Access Resources
Developing COUNTER Standards to Measure the Use of Open Access ResourcesDeveloping COUNTER Standards to Measure the Use of Open Access Resources
Developing COUNTER Standards to Measure the Use of Open Access Resources
UCD Library
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
Tao Xie
 
Bots & spiders
Bots & spidersBots & spiders
Bots & spiders
Maté Ongenaert
 
The data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architecturesThe data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architectures
Vincenzo Gulisano
 
Building and Measuring Privacy-Preserving Mobility Analytics
Building and Measuring Privacy-Preserving Mobility AnalyticsBuilding and Measuring Privacy-Preserving Mobility Analytics
Building and Measuring Privacy-Preserving Mobility Analytics
Emiliano De Cristofaro
 
"Automated Malware Analysis" de Gabriel Negreira Barbosa, Malware Research an...
"Automated Malware Analysis" de Gabriel Negreira Barbosa, Malware Research an..."Automated Malware Analysis" de Gabriel Negreira Barbosa, Malware Research an...
"Automated Malware Analysis" de Gabriel Negreira Barbosa, Malware Research an...
SegInfo
 
3 Pitfalls Everyone Should Avoid with Cloud Native Observability
3 Pitfalls Everyone Should Avoid with Cloud Native Observability3 Pitfalls Everyone Should Avoid with Cloud Native Observability
3 Pitfalls Everyone Should Avoid with Cloud Native Observability
Eric D. Schabell
 
2015 moloch recipes
2015 moloch recipes2015 moloch recipes
2015 moloch recipes
Geoffrey Crespin
 
BotMagnifier: Locating Spambots on the Internet
BotMagnifier: Locating Spambots on the InternetBotMagnifier: Locating Spambots on the Internet
BotMagnifier: Locating Spambots on the Internet
Gianluca Stringhini
 
PhD Symposium 2014
PhD Symposium 2014PhD Symposium 2014
PhD Symposium 2014
Fabio Palomba
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
Edward Baker
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
Vince Smith
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
c.titus.brown
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
c.titus.brown
 
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Pete Burnap
 
Technical Workshop - Win32/Georbot Analysis
Technical Workshop - Win32/Georbot AnalysisTechnical Workshop - Win32/Georbot Analysis
Technical Workshop - Win32/Georbot AnalysisPositive Hack Days
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
c.titus.brown
 
Large scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningLarge scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log mining
itstuff
 
IEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesIEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesNish Parikh
 
A Fast, Offline Reverse Geocoder in Python
A Fast, Offline Reverse Geocoder in PythonA Fast, Offline Reverse Geocoder in Python
A Fast, Offline Reverse Geocoder in Python
Ajay Thampi
 

Similar to #iCanHazRobot?: improved robot detection for IR usage statistics (20)

Developing COUNTER Standards to Measure the Use of Open Access Resources
Developing COUNTER Standards to Measure the Use of Open Access ResourcesDeveloping COUNTER Standards to Measure the Use of Open Access Resources
Developing COUNTER Standards to Measure the Use of Open Access Resources
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
Bots & spiders
Bots & spidersBots & spiders
Bots & spiders
 
The data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architecturesThe data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architectures
 
Building and Measuring Privacy-Preserving Mobility Analytics
Building and Measuring Privacy-Preserving Mobility AnalyticsBuilding and Measuring Privacy-Preserving Mobility Analytics
Building and Measuring Privacy-Preserving Mobility Analytics
 
"Automated Malware Analysis" de Gabriel Negreira Barbosa, Malware Research an...
"Automated Malware Analysis" de Gabriel Negreira Barbosa, Malware Research an..."Automated Malware Analysis" de Gabriel Negreira Barbosa, Malware Research an...
"Automated Malware Analysis" de Gabriel Negreira Barbosa, Malware Research an...
 
3 Pitfalls Everyone Should Avoid with Cloud Native Observability
3 Pitfalls Everyone Should Avoid with Cloud Native Observability3 Pitfalls Everyone Should Avoid with Cloud Native Observability
3 Pitfalls Everyone Should Avoid with Cloud Native Observability
 
2015 moloch recipes
2015 moloch recipes2015 moloch recipes
2015 moloch recipes
 
BotMagnifier: Locating Spambots on the Internet
BotMagnifier: Locating Spambots on the InternetBotMagnifier: Locating Spambots on the Internet
BotMagnifier: Locating Spambots on the Internet
 
PhD Symposium 2014
PhD Symposium 2014PhD Symposium 2014
PhD Symposium 2014
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
 
Technical Workshop - Win32/Georbot Analysis
Technical Workshop - Win32/Georbot AnalysisTechnical Workshop - Win32/Georbot Analysis
Technical Workshop - Win32/Georbot Analysis
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
Large scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningLarge scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log mining
 
IEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesIEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slides
 
A Fast, Offline Reverse Geocoder in Python
A Fast, Offline Reverse Geocoder in PythonA Fast, Offline Reverse Geocoder in Python
A Fast, Offline Reverse Geocoder in Python
 

More from UCD Library

The role of academic libraries in supporting a culture of research integrity
The role of academic libraries in supporting a culture of research integrityThe role of academic libraries in supporting a culture of research integrity
The role of academic libraries in supporting a culture of research integrity
UCD Library
 
Collection Management and GreenGlass at UCD Library
Collection Management and GreenGlass at UCD LibraryCollection Management and GreenGlass at UCD Library
Collection Management and GreenGlass at UCD Library
UCD Library
 
The authentic research experience: UCD Special Collections in the BA Humanities
The authentic research experience: UCD Special Collections in the BA HumanitiesThe authentic research experience: UCD Special Collections in the BA Humanities
The authentic research experience: UCD Special Collections in the BA Humanities
UCD Library
 
Show and teach: the role of exhibitions in outreach and education
Show and teach: the role of exhibitions in outreach and educationShow and teach: the role of exhibitions in outreach and education
Show and teach: the role of exhibitions in outreach and education
UCD Library
 
Print to pixels: digitised periodical collections in UCD Digital Library
Print to pixels: digitised periodical collections in UCD Digital LibraryPrint to pixels: digitised periodical collections in UCD Digital Library
Print to pixels: digitised periodical collections in UCD Digital Library
UCD Library
 
Appearances can be deceiving: how to avoid 'predatory' publishers
Appearances can be deceiving: how to avoid 'predatory' publishersAppearances can be deceiving: how to avoid 'predatory' publishers
Appearances can be deceiving: how to avoid 'predatory' publishers
UCD Library
 
Re-using OERs in UCD’s Research Accelerator for the Social Sciences Online Mo...
Re-using OERs in UCD’s Research Accelerator for the Social Sciences Online Mo...Re-using OERs in UCD’s Research Accelerator for the Social Sciences Online Mo...
Re-using OERs in UCD’s Research Accelerator for the Social Sciences Online Mo...
UCD Library
 
UCD Library's Training Programme and Resources for Researchers
UCD Library's Training Programme and Resources for ResearchersUCD Library's Training Programme and Resources for Researchers
UCD Library's Training Programme and Resources for Researchers
UCD Library
 
Going Global: UCD Library's Experience of Teaching Information Literacy in China
Going Global: UCD Library's Experience of Teaching Information Literacy in ChinaGoing Global: UCD Library's Experience of Teaching Information Literacy in China
Going Global: UCD Library's Experience of Teaching Information Literacy in China
UCD Library
 
Going Global: UCD Library's Experiences in China
Going Global: UCD Library's Experiences in ChinaGoing Global: UCD Library's Experiences in China
Going Global: UCD Library's Experiences in China
UCD Library
 
Clifden Arts Festival Archive@UCD: an Overview
Clifden Arts Festival Archive@UCD: an OverviewClifden Arts Festival Archive@UCD: an Overview
Clifden Arts Festival Archive@UCD: an Overview
UCD Library
 
UCD Digital Library: Creating Digitised Content from Archival Collections - P...
UCD Digital Library: Creating Digitised Content from Archival Collections - P...UCD Digital Library: Creating Digitised Content from Archival Collections - P...
UCD Digital Library: Creating Digitised Content from Archival Collections - P...
UCD Library
 
Optimising Workflows for Digital Archives: UCD Digital Library
Optimising Workflows for Digital Archives: UCD Digital LibraryOptimising Workflows for Digital Archives: UCD Digital Library
Optimising Workflows for Digital Archives: UCD Digital Library
UCD Library
 
Creating the Collected Letters of Nano Nagle Digital Collection
Creating the Collected Letters of Nano Nagle Digital CollectionCreating the Collected Letters of Nano Nagle Digital Collection
Creating the Collected Letters of Nano Nagle Digital Collection
UCD Library
 
#Nuntastic: Transcribing Nano Nagle's Letters using Collaborative Transcripti...
#Nuntastic: Transcribing Nano Nagle's Letters using Collaborative Transcripti...#Nuntastic: Transcribing Nano Nagle's Letters using Collaborative Transcripti...
#Nuntastic: Transcribing Nano Nagle's Letters using Collaborative Transcripti...
UCD Library
 
Enhancing User Engagement and Experiences through the Development of UCD Libr...
Enhancing User Engagement and Experiences through the Development of UCD Libr...Enhancing User Engagement and Experiences through the Development of UCD Libr...
Enhancing User Engagement and Experiences through the Development of UCD Libr...
UCD Library
 
UCD Library and GreenGlass: Defining Needs, Redefining Collections
UCD Library and GreenGlass: Defining Needs, Redefining CollectionsUCD Library and GreenGlass: Defining Needs, Redefining Collections
UCD Library and GreenGlass: Defining Needs, Redefining Collections
UCD Library
 
Are They Being Served? Reference Services Student Experience Project, UCD Lib...
Are They Being Served? Reference Services Student Experience Project, UCD Lib...Are They Being Served? Reference Services Student Experience Project, UCD Lib...
Are They Being Served? Reference Services Student Experience Project, UCD Lib...
UCD Library
 
Pin It! Linking shelf-marks to shelf locations
Pin It! Linking shelf-marks to shelf locationsPin It! Linking shelf-marks to shelf locations
Pin It! Linking shelf-marks to shelf locations
UCD Library
 
Real Life Digital Curation and Preservation
Real Life Digital Curation and PreservationReal Life Digital Curation and Preservation
Real Life Digital Curation and Preservation
UCD Library
 

More from UCD Library (20)

The role of academic libraries in supporting a culture of research integrity
The role of academic libraries in supporting a culture of research integrityThe role of academic libraries in supporting a culture of research integrity
The role of academic libraries in supporting a culture of research integrity
 
Collection Management and GreenGlass at UCD Library
Collection Management and GreenGlass at UCD LibraryCollection Management and GreenGlass at UCD Library
Collection Management and GreenGlass at UCD Library
 
The authentic research experience: UCD Special Collections in the BA Humanities
The authentic research experience: UCD Special Collections in the BA HumanitiesThe authentic research experience: UCD Special Collections in the BA Humanities
The authentic research experience: UCD Special Collections in the BA Humanities
 
Show and teach: the role of exhibitions in outreach and education
Show and teach: the role of exhibitions in outreach and educationShow and teach: the role of exhibitions in outreach and education
Show and teach: the role of exhibitions in outreach and education
 
Print to pixels: digitised periodical collections in UCD Digital Library
Print to pixels: digitised periodical collections in UCD Digital LibraryPrint to pixels: digitised periodical collections in UCD Digital Library
Print to pixels: digitised periodical collections in UCD Digital Library
 
Appearances can be deceiving: how to avoid 'predatory' publishers
Appearances can be deceiving: how to avoid 'predatory' publishersAppearances can be deceiving: how to avoid 'predatory' publishers
Appearances can be deceiving: how to avoid 'predatory' publishers
 
Re-using OERs in UCD’s Research Accelerator for the Social Sciences Online Mo...
Re-using OERs in UCD’s Research Accelerator for the Social Sciences Online Mo...Re-using OERs in UCD’s Research Accelerator for the Social Sciences Online Mo...
Re-using OERs in UCD’s Research Accelerator for the Social Sciences Online Mo...
 
UCD Library's Training Programme and Resources for Researchers
UCD Library's Training Programme and Resources for ResearchersUCD Library's Training Programme and Resources for Researchers
UCD Library's Training Programme and Resources for Researchers
 
Going Global: UCD Library's Experience of Teaching Information Literacy in China
Going Global: UCD Library's Experience of Teaching Information Literacy in ChinaGoing Global: UCD Library's Experience of Teaching Information Literacy in China
Going Global: UCD Library's Experience of Teaching Information Literacy in China
 
Going Global: UCD Library's Experiences in China
Going Global: UCD Library's Experiences in ChinaGoing Global: UCD Library's Experiences in China
Going Global: UCD Library's Experiences in China
 
Clifden Arts Festival Archive@UCD: an Overview
Clifden Arts Festival Archive@UCD: an OverviewClifden Arts Festival Archive@UCD: an Overview
Clifden Arts Festival Archive@UCD: an Overview
 
UCD Digital Library: Creating Digitised Content from Archival Collections - P...
UCD Digital Library: Creating Digitised Content from Archival Collections - P...UCD Digital Library: Creating Digitised Content from Archival Collections - P...
UCD Digital Library: Creating Digitised Content from Archival Collections - P...
 
Optimising Workflows for Digital Archives: UCD Digital Library
Optimising Workflows for Digital Archives: UCD Digital LibraryOptimising Workflows for Digital Archives: UCD Digital Library
Optimising Workflows for Digital Archives: UCD Digital Library
 
Creating the Collected Letters of Nano Nagle Digital Collection
Creating the Collected Letters of Nano Nagle Digital CollectionCreating the Collected Letters of Nano Nagle Digital Collection
Creating the Collected Letters of Nano Nagle Digital Collection
 
#Nuntastic: Transcribing Nano Nagle's Letters using Collaborative Transcripti...
#Nuntastic: Transcribing Nano Nagle's Letters using Collaborative Transcripti...#Nuntastic: Transcribing Nano Nagle's Letters using Collaborative Transcripti...
#Nuntastic: Transcribing Nano Nagle's Letters using Collaborative Transcripti...
 
Enhancing User Engagement and Experiences through the Development of UCD Libr...
Enhancing User Engagement and Experiences through the Development of UCD Libr...Enhancing User Engagement and Experiences through the Development of UCD Libr...
Enhancing User Engagement and Experiences through the Development of UCD Libr...
 
UCD Library and GreenGlass: Defining Needs, Redefining Collections
UCD Library and GreenGlass: Defining Needs, Redefining CollectionsUCD Library and GreenGlass: Defining Needs, Redefining Collections
UCD Library and GreenGlass: Defining Needs, Redefining Collections
 
Are They Being Served? Reference Services Student Experience Project, UCD Lib...
Are They Being Served? Reference Services Student Experience Project, UCD Lib...Are They Being Served? Reference Services Student Experience Project, UCD Lib...
Are They Being Served? Reference Services Student Experience Project, UCD Lib...
 
Pin It! Linking shelf-marks to shelf locations
Pin It! Linking shelf-marks to shelf locationsPin It! Linking shelf-marks to shelf locations
Pin It! Linking shelf-marks to shelf locations
 
Real Life Digital Curation and Preservation
Real Life Digital Curation and PreservationReal Life Digital Curation and Preservation
Real Life Digital Curation and Preservation
 

Recently uploaded

Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
GeoBlogs
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
TechSoup
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
EduSkills OECD
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)
rosedainty
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
Nguyen Thanh Tu Collection
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
AzmatAli747758
 

Recently uploaded (20)

Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
 

#iCanHazRobot?: improved robot detection for IR usage statistics

  • 1. Leabharlann UCD An Coláiste Ollscoile, Baile Átha Cliath, Belfield, Baile Átha Cliath 4, Eire UCD Library University College Dublin, Belfield, Dublin 4, Ireland Joseph Greene Research Repository Librarian University College Dublin joseph.greene@ucd.ie http://researchrepository.ucd.ie #iCanHazRobot? Improved robot detection for IR usage statistics Open Repositories 2016 Dublin, 14 June
  • 2. Overview and take-home points • Usage stats are important – (go to the Usage Stats panel on Thursday, 16/Jun/2016: 11:00am - 12:30pm) • Robot filtration is a problem, especially in repositories • Robot detection has an exponential effect on usage stats’ accuracy in repositories • 2-3 ways to improve DSpace and EPrints’ usage stats by 20% or more will be demonstrated
  • 3. Experimental study • Simple random sample of 2 years of UCD repository’s download data – n=341, N=3.3 million; 96.20% certainty • Manually checked to determine if robot or human • Applied DSpace, EPrints robot detection algorithms to the dataset – This is an EXPERIMENT, simulating algorithms on a DSpace repository’s usage data and Apache logs – The data is real, live data, and the algorithms were very easy to simulate
  • 4. First finding 85% of unfiltered repository downloads come from robots • This is confirmed in a 2013 IRUS-UK white paper on 20 IRs; 85% was also found to be robots
  • 5. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Accuracyofdownloadstats(inverseprecition) Recall (robots) Catching more robots improves stats (But how much depends on the number of robots) Getbetterstats Catch more robots Typical website, 15% robot traffic OA journal, 40% robot Internet Archive, 91% robot OA repositories, 85% robot
  • 6. Robot detection techniques used DSpace EPrints Minho DSpace Statistics Add-on Rate of requests ✓3 User agent string ✓ ✓ ✓ robots.txt access ✓ Volume of requests ✓2 ✓3 List of known robot IP addresses ✓ ✓ Reverse DNS name lookup ✓1 Trap file ✓ User agents per IP address Width of traversal in the URL space ✓3 1 Only implemented nominally or experimentally 2 Via the repeat download or ‘double-click’ filter 3 Data available as a configurable report for manual decision making
  • 7. Measurements used in robot detection • All measurements are a number between 0 and 1 • Recall: proportion of robots detected – I can haz robot? • Precision: true positives in robot detection – Proportion of discounted downloads that are actually made by robots (sometimes humans are counted as robots) • Accuracy of download stats measured as inverse precision: – Proportion of stats that are actually made by humans
  • 8. How they perform, out-of-the-box 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DSpace EPrints Minho Minho with monthly manual checking No robot detection Robot detection in OA IR systems Recall Precision Negative precision (accuracy of download stats)
  • 10. 1. Ability to manually check for outliers • At UCD, once a month, we check: – Daily downloads for the last 2-4 months – Top 10 most downloaded items – Top 20 downloading IP addresses for the last 2-4 months
  • 11.
  • 12.
  • 13.
  • 14.
  • 15. 0 0.2 0.4 0.6 0.8 1 DSpace Eprints Minho Robots caught (Recall) Out-… 0 0.2 0.4 0.6 0.8 1 DSpace Eprints Minho Wihtout robot detection Accuracy of reported download stats (Inverse precision) Out-of-the-box With manual checking (outlier exclusion)
  • 16. 2. Recalibrate the EPrints repeat- download (double-click) filter 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall (robots) Precision (accuracy of excluded downloads) Inverse recall (legitimate downloads accounted for in stats) Inverse precision (accuracy of reported download stats) Overall accuracy Effect of double-click filter on EPrints’ robot detection and stats Without double-click filter With double-click filter (out-of-the-box) With recalibrated double-click filter* 𝑻𝒑 + 𝑻𝒏 𝒏
  • 17. 3. Port Minho’s robot detection code (a log parser) onto DSpace or EPrints • 1 Java class • Input is Apache Combined Log Format • Output is a database update (robot = true field) – Similar to EPrints' $is_robot variable in Robots.pm, – Could be modified to update the DSpace 'isBot' field in the SOLR usage events document • Requires 2 database tables to store learned agents and IPs
  • 18. 0 0.2 0.4 0.6 0.8 1 DSpace Eprints Minho Robots caught (Recall) 0 0.2 0.4 0.6 0.8 1 DSpace Eprints Minho Wihtout robot detection Accuracy of reported download stats (Inverse precision) Out-of-the-box With Minho log parser
  • 19. 4. Combine two or more techniques 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DSpace Eprints Minho Robots caught (Recall) Out-of-the-box With manual checking (outlier exclusion) With recalibrated double click filter* With Minho log parser With Minho and outliers Minho, outliers, and recalibrated double- click*
  • 20. 4. Combine two or more techniques 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DSpace Eprints Minho Wihtout robot detection Accuracy of reported download stats (Inverse precision) Out-of-the-box With manual checking (outlier exclusion) With recalibrated double click filter* With Minho log parser With Minho and outliers Minho, outliers, and recalibrated double- click*

Editor's Notes

  1. Good news: DSpace and EPrints do robot filtration out-of-the-box, bad news: the stats are still quite inaccurate More good news: Improving robot recall has an exponential effect on usage stats accuracy Usage stats: primarily download counts, used heavily in marketing the repository and they provide a measure of ROI both to those who have uploaded them (investment of time/effort) and to those who fund the repository. More downloads = more UCD visibility – one measure of our ROI.
  2. Experiment: simple random sample of 2 years of download data (n=341, N=3.3 million for 96.20% certainty), manually checked to determine if robot or human. DSpace 1.8.2 with U. Minho DSpace Statistics Add-on v. 4. Apache Tomcat behind Apache HTTP server; logs in Apache Combined Log Format. Minho registers every download in the PostgreSQL database. Results to be published in July 2016 issue of Library Hi Tech (Greene 2016) This dataset is used to experimentally test different detection techniques used alone and in combination Weaknesses: The data is taken from a DSpace/Minho system (it's own SEO, it's own way of being crawled, etc.) 'In vitro': Except for the original system (DSpace/Minho + monthly manual outlier checking), the robot detection techniques are simulated. Hence, EXPERIMENTAL Strengths: 'In vivo': the data is real data from a production OA IR system Simulating the various detection techniques was very easy to do, so is probably a very accurate picture of how each system would have treated this dataset
  3. See: INFORMATION POWER LTD. 2013. IRUS download data: identifying unusual usage [Online]. Available: http://www.irus.mimas.ac.uk/news/IRUS_download_data_Final_report.pdf [Accessed 2015-12-11. Confirms 85% figure DORAN, D. & GOKHALE, S. S. 2011. Web robot detection techniques: overview and limitations. Data Mining and Knowledge Discovery, 22, 183-210. Hypothesizes why so high in OA (p.191)
  4. Typical website (15% robot traffic) (precision = 0.8727, mean of four studies; robots:total sessions = 0.1516, mean of four studies) OA journal (40% robot) HUNTINGTON, P., NICHOLAS, D. & JAMALI, H. R. 2008. Web robot detection in the scholarly information environment. Journal of Information Science, 34, 726-741. OA repositories (85% robot) Greene 2016 and Information Power 2013 (see above) Internet Archive (91% robot) ALNOAMANY, Y., WEIGLE, M. C. & NELSON, M. L. 2013. Access patterns for robots and humans in web archives. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 339-348. Reverse is also true: fail to catch robots (e.g. deterioration over time as robots improve their capabilities), accuracy of stats diminishes Formula: Greene 2016 𝐏𝐢𝐧𝐯 = 𝐓𝐑(𝐑−𝐏𝐑−𝟏)+𝟐𝐓𝐏𝐑−𝐏(𝐓+𝐑−𝟏) 𝐑(𝐓𝐑−𝐏−𝐓)+𝐏 R = recall (robot detection) P = precision (robot detection) Pinv = inverse precision (human stats) T = ratio of robots to total
  5. Greene 2016
  6. Minho with monthly manual checking is the original data as measured in vivo. Minho alone has detected manual outliers removed. DSpace and EPrints have been generated by applying their native algorithms to the data.
  7. Outliers: c.f. LAMOTHE, A. R. 2014. The importance of identifying and accommodating e-resource usage data for the presence of outliers. Information Technology and Libraries, 33, 31-44.
  8. *Recalibrated double-click filter: a single IP address downloading a single item more than 10 times in 24 hours is excluded. By default the filter is 1 IP, downloads 1 item more than 1 time in 24 hours. This can be configured in terms of the timeout length but currently can't be configured to increase the number of downloads allowed within the period See also: JOINT, N., FIELD, A. & GREGSON, M. 2011. Please change the way IRstats works [Online]. Available: http://www.eprints.org/tech.php/15695.html [Accessed November 28 2015]. The drop in inverse recall (loss of legitimate downloads) supports the concern raised in this email discussion. However, if the recalibration were to be implemented, the improvement to robot precision means that the increase in legitimate downloads is offset by the decrease in illegitimate ones, so inverse precision is not affected a great deal. Overall accuracy improves notably however.
  9. *Recalibrated double-click filter: a single IP address downloading a single item more than 10 times in 24 hours is excluded. By default the filter is 1 IP, downloads 1 item more than 1 time in 24 hours. This can be configured in terms of the timeout length but currently can't be configured to increase the number of downloads allowed within the period
  10. *Recalibrated double-click filter: a single IP address downloading a single item more than 10 times in 24 hours is excluded. By default the filter is 1 IP, downloads 1 item more than 1 time in 24 hours. This can be configured in terms of the timeout length but currently can't be configured to increase the number of downloads allowed within the period