Robots Still Outnumber Humans in Web
Archives, But Less Than Before
Himarsha R. Jayanetti1
, Kritika Garg1
, Sawood Alam2
, Michael L. Nelson1
, and Michele C. Weigle1
1
Web Science & Digital Libraries Research Group
Old Dominion University, Norfolk VA, USA
@WebSciDL
2
Wayback Machine, Internet Archive
San Francisco, California, USA
@internetarchive
Presented By:
Himarsha R. Jayanetti
Department of Computer Science
Old Dominion University, Norfolk, Virginia
@HimarshaJ @WebSciDL @oducs
TPDL ‘22, The 26th International Conference on Theory and Practice of Digital Libraries, Padua, Italy, 20 - 23 September 2022
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
A Scenario in Which a Human Accesses Web Archives
2
https://web.archive.org/web/20120313134227/http://www.li
b.odu.edu/exhibits/odu75thanniversary/norfolkdivision.htm
https://en.wikipedia.org/wiki/Old_Dominion_University#References
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Another Scenario Is When Bot Services Query Web Archives
3
TimeMap
Visualization
Tool
(TMVis)
https://github.com
/oduwsdl/tmvis
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
4
Slider
GIF
TMViz Visualizes How Individual Webpages Have Changed Over Time
https://web.archive.org/web/2022000
0000000*/http://4genderjustice.org/
TimeMap
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Our Study Is an Extension of a Previous Study
Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson, “Access Patterns for Robots and Humans in Web Archives,” In Proceedings of the
ACM/IEEE Joint Conference on Digital Libraries (JCDL). Indianapolis, IN, July 2013, pp. 339-–348. https://doi.org/10.1145/2467696.2467722
5
● Robots outnumber humans:
○ 10:1 (sessions)
○ 5:4 (raw HTTP accesses)
○ 4:1 (megabytes transferred)
● Robots almost always access TimeMaps, but humans access
the mementos.
● No overall preference for mementos of a particular time, but
the recent past (within the last year) shows significant repeat
accesses.
● Proposed access patterns of web archive users:
○ Dip
○ Slide
○ Dive
○ Skim
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
6
Access Patterns of Web Archive Users: Dip and Dive
Original Resource
(URI-R1
)
URI-R2
URI-R3
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
7
Access Patterns of Web Archive Users: Slide and Skim
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
We Used Three Full Day Access Log Datasets
Full day sample of access logs:
● IA 2012
Internet Archive - 2012
(February 2, 2012)
● IA 2019
Internet Archive - 2019
(February 7, 2019)
● PT 2019
Arquivo.pt - 2019
(February 7, 2019)
Feature IA 2012 IA 2019 PT 2019
No. of Requests 99,173,542 308,194,916 1,046,855
GET 98.80% 98.68% 97.92%
HEAD 1.12% 0.84% 1.37%
Status Code 2xx 32.73% 48.26% 26.03%
Status Code 3xx 52.57% 42.74% 20.22%
Status Code 4xx 11.71% 8.79% 53.58%
Status Code 5xx 2.99% 0.20% 0.17%
Embedded
Resources
43.62% 63.36% 19.68%
SI Bot 0.01% 0.15% 0.34%
8
https://web.archive.org/
https://arquivo.pt/
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Full day sample of access logs:
● IA 2012
Internet Archive - 2012
(February 2, 2012)
● IA 2019
Internet Archive - 2019
(February 7, 2019)
● PT 2019
Arquivo.pt - 2019
(February 7, 2019)
Feature IA 2012 IA 2019 PT 2019
No. of Requests 99,173,542 308,194,916 1,046,855
GET 98.80% 98.68% 97.92%
HEAD 1.12% 0.84% 1.37%
Status Code 2xx 32.73% 48.26% 26.03%
Status Code 3xx 52.57% 42.74% 20.22%
Status Code 4xx 11.71% 8.79% 53.58%
Status Code 5xx 2.99% 0.20% 0.17%
Embedded
Resources
43.62% 63.36% 19.68%
SI Bot 0.01% 0.15% 0.34%
9
https://web.archive.org/
https://arquivo.pt/
We Used Three Full Day Access Log Datasets
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Full day sample of access logs:
● IA 2012
Internet Archive - 2012
(February 2, 2012)
● IA 2019
Internet Archive - 2019
(February 7, 2019)
● PT 2019
Arquivo.pt - 2019
(February 7, 2019)
Feature IA 2012 IA 2019 PT 2019
No. of Requests 99,173,542 308,194,916 1,046,855
GET 98.80% 98.68% 97.92%
HEAD 1.12% 0.84% 1.37%
Status Code 2xx 32.73% 48.26% 26.03%
Status Code 3xx 52.57% 42.74% 20.22%
Status Code 4xx 11.71% 8.79% 53.58%
Status Code 5xx 2.99% 0.20% 0.17%
Embedded
Resources
43.62% 63.36% 19.68%
SI Bot 0.01% 0.15% 0.34%
10
https://web.archive.org/
https://arquivo.pt/
We Used Three Full Day Access Log Datasets
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Full day sample of access logs:
● IA 2012
Internet Archive - 2012
(February 2, 2012)
● IA 2019
Internet Archive - 2019
(February 7, 2019)
● PT 2019
Arquivo.pt - 2019
(February 7, 2019)
Feature IA 2012 IA 2019 PT 2019
No. of Requests 99,173,542 308,194,916 1,046,855
GET 98.80% 98.68% 97.92%
HEAD 1.12% 0.84% 1.37%
Status Code 2xx 32.73% 48.26% 26.03%
Status Code 3xx 52.57% 42.74% 20.22%
Status Code 4xx 11.71% 8.79% 53.58%
Status Code 5xx 2.99% 0.20% 0.17%
Embedded
Resources
43.62% 63.36% 19.68%
SI Bot 0.01% 0.15% 0.34%
11
https://web.archive.org/
https://arquivo.pt/
We Used Three Full Day Access Log Datasets
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Full day sample of access logs:
● IA 2012
Internet Archive - 2012
(February 2, 2012)
● IA 2019
Internet Archive - 2019
(February 7, 2019)
● PT 2019
Arquivo.pt - 2019
(February 7, 2019)
Feature IA 2012 IA 2019 PT 2019
No. of Requests 99,173,542 308,194,916 1,046,855
GET 98.80% 98.68% 97.92%
HEAD 1.12% 0.84% 1.37%
Status Code 2xx 32.73% 48.26% 26.03%
Status Code 3xx 52.57% 42.74% 20.22%
Status Code 4xx 11.71% 8.79% 53.58%
Status Code 5xx 2.99% 0.20% 0.17%
Embedded
Resources
43.62% 63.36% 19.68%
SI Bot 0.01% 0.15% 0.34%
12
https://web.archive.org/
https://arquivo.pt/
We Used Three Full Day Access Log Datasets
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Full day sample of access logs:
● IA 2012
Internet Archive - 2012
(February 2, 2012)
● IA 2019
Internet Archive - 2019
(February 7, 2019)
● PT 2019
Arquivo.pt - 2019
(February 7, 2019)
Feature IA 2012 IA 2019 PT 2019
No. of Requests 99,173,542 308,194,916 1,046,855
GET 98.80% 98.68% 97.92%
HEAD 1.12% 0.84% 1.37%
Status Code 2xx 32.73% 48.26% 26.03%
Status Code 3xx 52.57% 42.74% 20.22%
Status Code 4xx 11.71% 8.79% 53.58%
Status Code 5xx 2.99% 0.20% 0.17%
Embedded
Resources
43.62% 63.36% 19.68%
SI Bot 0.01% 0.15% 0.34%
13
https://web.archive.org/
https://arquivo.pt/
We Used Three Full Day Access Log Datasets
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
We Followed a Two-Step Data Cleaning Process
Stage 1
● Remove log entries that were either invalid or
irrelevant to the analysis.
○ Everything except requests to Mementos
○ Everything except requests to TimeMaps
○ Kept the requests to the robots.txt of the
web archive.
Dataset Before
(No. of Requests)
After Cleaning (No. of Requests)
Stage 1 Stage 2
IA 2012 99,173,542 85.22% 18.58%
IA 2019 308,194,916 77.19% 11.36%
PT 2019 1,046,855 86.40% 57.77%
14
Stage 2
● Remove log entries that were irrelevant in
terms of user behavior.
○ Everything except GET requests
○ Everything except 200, 404, and 503
response codes
○ Embedded resources
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
We Divided the Access Logs Into Different Sessions
○ Grouped the requests based on the IP and User-Agent.
○ Divided the requests of each user into individual sessions (timeout threshold: 10 minute)
15
1.1.0.100_0_1 - - [02/Feb/2012:04:36:43 +0000] "GET http://web.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0
(Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_1 - - [02/Feb/2012:04:36:44 +0000] "GET http://wayback.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 404 2164 "-"
"Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:04:58:34 +0000] "GET http://web.archive.org/web/*/LETSMILK.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0
(Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:04:58:37 +0000] "GET http://wayback.archive.org/web/*/LETSMILK.RU HTTP/1.0" 404 2162 "-"
"Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:05:00:21 +0000] "GET http://web.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0
(Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:05:00:49 +0000] "GET http://wayback.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 503 2197 "-"
"Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
The IP addresses
are anonymized.
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
16
1.1.0.100_0_1 - - [02/Feb/2012:04:36:43 +0000] "GET http://web.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0
(Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_1 - - [02/Feb/2012:04:36:44 +0000] "GET http://wayback.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 404 2164 "-"
"Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:04:58:34 +0000] "GET http://web.archive.org/web/*/LETSMILK.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0
(Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:04:58:37 +0000] "GET http://wayback.archive.org/web/*/LETSMILK.RU HTTP/1.0" 404 2162 "-"
"Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:05:00:21 +0000] "GET http://web.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0
(Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:05:00:49 +0000] "GET http://wayback.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 503 2197 "-"
"Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
The duration between
the two requests
> 10 Minutes
○ Grouped the requests based on the IP and User-Agent.
○ Divided the requests of each user into individual sessions (timeout threshold: 10 minute)
We Divided the Access Logs Into Different Sessions
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Heuristic: The Type of Request (HEAD Requests)
199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
● Web browsers issue GET requests for web pages.
● We flagged the requests making HEAD requests as bots.
17
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Heuristic: List of Known Bots
● A manually compiled User-Agent list of known bots.
● User-Agents with keywords such as bot, crawler, spider, etc.
● Python module "DeviceDetector", which is a User-Agent parser which will help us determine
whether or not the User-Agent is a bot.
199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
https://pypi.org/project/device-detector/ (DeviceDetector, User-Agent Parser)
https://github.com/oduwsdl/access-patterns/blob/main/Known_Bot_List/knownbot.tsv (Final Known Bot List)
18
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Heuristic: Number of User-Agents per IP
x0.77.87.100 - - [02/Feb/2012:03:46:54 +0000] "POST http://web.archive.org/web/20070211155651/http://212.227.83.57/cproc.aspx HTTP/1.0" 302 0
"http://www.vbleisure.co.uk/guest_book.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0)"
00101000
x0.77.87.100 - - [02/Feb/2012:04:06:29 +0000] "POST http://web.archive.org/web/20070211155651/http://212.227.83.57/cproc.aspx HTTP/1.0" 302 -
"http://www.vbleisure.co.uk/guest_book.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
00101000
x0.77.87.100 - - [02/Feb/2012:05:09:30 +0000] "POST http://web.archive.org/web/20070211155651/http://212.227.83.57/cproc.aspx HTTP/1.0" 302 -
"http://www.vbleisure.co.uk/guest_book.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
00101000
x0.77.87.100 - - [02/Feb/2012:07:59:43 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php
HTTP/1.0" 302 0 "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; ru) Opera 8.50" 00101000
. . .
. . .
. . .
x0.77.87.100 - - [02/Feb/2012:22:04:57 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php
HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Creative)"
00101000
x0.77.87.100 - - [02/Feb/2012:22:08:02 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php
HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)"
00101000
x0.77.87.100 - - [02/Feb/2012:23:40:31 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php
HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.0" 00101000
x0.77.87.100 - - [02/Feb/2012:23:40:32 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php
HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; MRA 4.6 (build 01425))" 00101000
x0.77.87.100 - - [02/Feb/2012:23:59:34 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php
HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Opera/7.60 (Windows NT 5.2; U) [en] (IBM EVV/3.0/EAK01AG9/LE)" 00101000
● Some of the bots keep changing their User-Agent between requests to avoid being detected as a bot.
● We have flagged requests from IPs that update their User-Agent field more than 20 times as bots.
19
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Heuristic: Requests to robots.txt File
● Legitimate bots will typically request robots.txt to determine what they are allowed to crawl.
● We considered a request for the robots.txt file as an indication for a bot request.
0.139.100.213_2_2 - - [02/Feb/2012:17:03:22 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2"
00011000
0.139.100.213_2_2 - - [02/Feb/2012:17:06:30 +0000] "GET http://web.archive.org/web/*/http://c00lbookmarks.com/story.php?title=best-door-blinds-inside
HTTP/1.1" 302 0 "-" "RSS Scout 0.9.2" 00001000
0.139.100.213_2_2 - - [02/Feb/2012:17:06:32 +0000] "GET http://wayback.archive.org/web/*/http://c00lbookmarks.com/story.php?title=best-door-blinds-inside
HTTP/1.1" 404 2409 "http://web.archive.org/web/*/http://c00lbookmarks.com/story.php?title=best-door-blinds-inside" "RSS Scout 0.9.2"
00001000
0.139.100.213_2_2 - - [02/Feb/2012:17:07:38 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2"
00011000
0.139.100.213_2_2 - - [02/Feb/2012:17:10:44 +0000] "GET http://web.archive.org/web/*/http://www.goloco.org/users/D5EWwXI HTTP/1.1" 302 0 "-" "RSS Scout
0.9.2" 00001000
0.139.100.213_2_2 - - [02/Feb/2012:17:10:45 +0000] "GET http://wayback.archive.org/web/*/http://www.goloco.org/users/D5EWwXI HTTP/1.1" 404 2385
"http://web.archive.org/web/*/http://www.goloco.org/users/D5EWwXI" "RSS Scout 0.9.2" 00001000
0.139.100.213_2_2 - - [02/Feb/2012:17:14:50 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2"
00011000
0.139.100.213_2_2 - - [02/Feb/2012:17:19:54 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2"
00011000
0.139.100.213_2_2 - - [02/Feb/2012:17:27:26 +0000] "GET
http://wayback.archive.org/web/*/http://epicbookmarks.com/story.php?title=door-blinds-inside-installation HTTP/1.1" 404 2416
"http://web.archive.org/web/*/http://epicbookmarks.com/story.php?title=door-blinds-inside-installation" "RSS Scout 0.9.2"
00001000
0.139.100.213_2_2 - - [02/Feb/2012:17:27:26 +0000] "GET
http://web.archive.org/web/*/http://epicbookmarks.com/story.php?title=door-blinds-inside-installation HTTP/1.1" 302 0 "-" "RSS Scout 0.9.2"
00001000
20
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Heuristic: Image to HTML ratio
● Image-to-HTML, is the ratio between the number of image files and the number of HTML files per session.
● Robots tend to retrieve only HTML pages (ignoring images and other embedded resources). Therefore human
sessions should have more images than robot sessions.
● We flagged a session with less than one image file for every 10 HTML files as a robot session.
21
http://web.archive.org/web/20220512060725/https://www.odu.edu/
Downloaded
using cURL
Accessed in the
Web Browser
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Heuristic: Browsing Speed
● We considered a browsing speed >= 0.5 (requests per second) as a threshold to detect robot sessions.
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET
/web/20190205174131/https://connect.facebook.net/signals/config/225699104785488?v=2.8.40&r=stable HTTP/1.1" 200
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET
/web/20190207001831/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 200
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET
/web/20190207001831/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 200
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET
/web/20190207001831/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 200
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET
/web/20190207001831/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 200
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004025/https://connect.facebook.net/en_US/fbevents.js HTTP/1.1" 200
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET
/web/20190207004025/https://connect.facebook.net/signals/config/225699104785488?v=2.8.40&r=stable HTTP/1.1" 302
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004026/https://embed.tawk.to/59cc85aec28eca75e4622ccd/default HTTP/1.1"
200
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004026/https://embed.tawk.to/59cc85aec28eca75e4622ccd/default HTTP/1.1"
200
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET
/web/20190207004026/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 302
. . .
. . .
22
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Results of Applying the Heuristics Separately to Detect Bots
23
Heuristics
IA 2012 IA 2019 PT 2019
Sessions:
1.53M
Requests:
22.3M
Sessions:
2.7M
Requests:
42.9M
Sessions:
3.7k
Requests:
614k
Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0%
#UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4%
Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7%
Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0%
Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0%
Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0%
The number of requests/sessions which had been
labeled as robots from each heuristic separately
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Image-to-HTML Ratio Had the Largest Effect on
Detecting Robots
24
Heuristics
IA 2012 IA 2019 PT 2019
Sessions:
1.53M
Requests:
22.3M
Sessions:
2.7M
Requests:
42.9M
Sessions:
3.7k
Requests:
614k
Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0%
#UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4%
Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7%
Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0%
Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0%
Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0%
Image-to-HTML ratio had the largest effect on detecting robots
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Total Number of Detected Bots After Applying All the
Heuristics Together
25
Heuristics
IA 2012 IA 2019 PT 2019
Sessions:
1.53M
Requests:
22.3M
Sessions:
2.7M
Requests:
42.9M
Sessions:
3.7k
Requests:
614k
Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0%
#UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4%
Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7%
Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0%
Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0%
Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0%
The number of requests/sessions detected after applying all the heuristics together.
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Potential Reasons for the Increase in Human Sessions
in 2019 Than in 2012
26
Heuristics
IA 2012 IA 2019 PT 2019
Sessions:
1.53M
Requests:
22.3M
Sessions:
2.7M
Requests:
42.9M
Sessions:
3.7k
Requests:
614k
Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0%
#UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4%
Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7%
Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0%
Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0%
Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0%
The number of requests/sessions detected after applying all the heuristics together.
Increase in awareness of web archives among human users in recent years
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
27
Heuristics
IA 2012 IA 2019 PT 2019
Sessions:
1.53M
Requests:
22.3M
Sessions:
2.7M
Requests:
42.9M
Sessions:
3.7k
Requests:
614k
Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0%
#UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4%
Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7%
Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0%
Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0%
Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0%
The number of requests/sessions detected after applying all the heuristics together.
Increase in popularity of headless browsers set up by, Headless Chromium,
PhantomJS, Selenium, and Puppeteer in recent years
Potential Reasons for the Increase in Human Sessions
in 2019 Than in 2012
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
In IA2012, the Robots Were Almost
Exclusively Limited To Dip and Skim
28
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
The Majority of Requests Are for Mementos Around
the Time Each Access Log Sample Was Taken.
29
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Out of requests,
IA 2012: 91%
IA 2019: 70%
PT 2019: 98%
Out of sessions,
IA 2012: 88%
IA 2019: 70%
PT 2019: 97%
Key Takeaways
In IA2012, the robots were
almost exclusively limited to Dip
and Skim, but that in IA2019,
they exhibit all of the patterns
and their combinations.
30
Dataset/
Feature
AlNoamany et al. Our Study
Sample
Duration
30 minute 24 hrs
Web Archives Internet Archive Internet Archive &
Arquivo.pt
Access Log
Year
2012 2012 & 2019
Majority of the
requests are for
mementos that
are close to the
datetime of each
log sample.
The percentage
of web archive
accesses that
were detected
as robots.
Himarsha R. Jayanetti
hjaya002@odu.edu
@HimarshaJ
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Backup slides …
31
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
An Overview of Our Approach and the Steps Followed
32

Robots Still Outnumber Humans in Web Archives, But Less Than Before

  • 1.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before Himarsha R. Jayanetti1 , Kritika Garg1 , Sawood Alam2 , Michael L. Nelson1 , and Michele C. Weigle1 1 Web Science & Digital Libraries Research Group Old Dominion University, Norfolk VA, USA @WebSciDL 2 Wayback Machine, Internet Archive San Francisco, California, USA @internetarchive Presented By: Himarsha R. Jayanetti Department of Computer Science Old Dominion University, Norfolk, Virginia @HimarshaJ @WebSciDL @oducs TPDL ‘22, The 26th International Conference on Theory and Practice of Digital Libraries, Padua, Italy, 20 - 23 September 2022
  • 2.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL A Scenario in Which a Human Accesses Web Archives 2 https://web.archive.org/web/20120313134227/http://www.li b.odu.edu/exhibits/odu75thanniversary/norfolkdivision.htm https://en.wikipedia.org/wiki/Old_Dominion_University#References
  • 3.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Another Scenario Is When Bot Services Query Web Archives 3 TimeMap Visualization Tool (TMVis) https://github.com /oduwsdl/tmvis
  • 4.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL 4 Slider GIF TMViz Visualizes How Individual Webpages Have Changed Over Time https://web.archive.org/web/2022000 0000000*/http://4genderjustice.org/ TimeMap
  • 5.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Our Study Is an Extension of a Previous Study Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson, “Access Patterns for Robots and Humans in Web Archives,” In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL). Indianapolis, IN, July 2013, pp. 339-–348. https://doi.org/10.1145/2467696.2467722 5 ● Robots outnumber humans: ○ 10:1 (sessions) ○ 5:4 (raw HTTP accesses) ○ 4:1 (megabytes transferred) ● Robots almost always access TimeMaps, but humans access the mementos. ● No overall preference for mementos of a particular time, but the recent past (within the last year) shows significant repeat accesses. ● Proposed access patterns of web archive users: ○ Dip ○ Slide ○ Dive ○ Skim
  • 6.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL 6 Access Patterns of Web Archive Users: Dip and Dive Original Resource (URI-R1 ) URI-R2 URI-R3
  • 7.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL 7 Access Patterns of Web Archive Users: Slide and Skim
  • 8.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL We Used Three Full Day Access Log Datasets Full day sample of access logs: ● IA 2012 Internet Archive - 2012 (February 2, 2012) ● IA 2019 Internet Archive - 2019 (February 7, 2019) ● PT 2019 Arquivo.pt - 2019 (February 7, 2019) Feature IA 2012 IA 2019 PT 2019 No. of Requests 99,173,542 308,194,916 1,046,855 GET 98.80% 98.68% 97.92% HEAD 1.12% 0.84% 1.37% Status Code 2xx 32.73% 48.26% 26.03% Status Code 3xx 52.57% 42.74% 20.22% Status Code 4xx 11.71% 8.79% 53.58% Status Code 5xx 2.99% 0.20% 0.17% Embedded Resources 43.62% 63.36% 19.68% SI Bot 0.01% 0.15% 0.34% 8 https://web.archive.org/ https://arquivo.pt/
  • 9.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Full day sample of access logs: ● IA 2012 Internet Archive - 2012 (February 2, 2012) ● IA 2019 Internet Archive - 2019 (February 7, 2019) ● PT 2019 Arquivo.pt - 2019 (February 7, 2019) Feature IA 2012 IA 2019 PT 2019 No. of Requests 99,173,542 308,194,916 1,046,855 GET 98.80% 98.68% 97.92% HEAD 1.12% 0.84% 1.37% Status Code 2xx 32.73% 48.26% 26.03% Status Code 3xx 52.57% 42.74% 20.22% Status Code 4xx 11.71% 8.79% 53.58% Status Code 5xx 2.99% 0.20% 0.17% Embedded Resources 43.62% 63.36% 19.68% SI Bot 0.01% 0.15% 0.34% 9 https://web.archive.org/ https://arquivo.pt/ We Used Three Full Day Access Log Datasets
  • 10.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Full day sample of access logs: ● IA 2012 Internet Archive - 2012 (February 2, 2012) ● IA 2019 Internet Archive - 2019 (February 7, 2019) ● PT 2019 Arquivo.pt - 2019 (February 7, 2019) Feature IA 2012 IA 2019 PT 2019 No. of Requests 99,173,542 308,194,916 1,046,855 GET 98.80% 98.68% 97.92% HEAD 1.12% 0.84% 1.37% Status Code 2xx 32.73% 48.26% 26.03% Status Code 3xx 52.57% 42.74% 20.22% Status Code 4xx 11.71% 8.79% 53.58% Status Code 5xx 2.99% 0.20% 0.17% Embedded Resources 43.62% 63.36% 19.68% SI Bot 0.01% 0.15% 0.34% 10 https://web.archive.org/ https://arquivo.pt/ We Used Three Full Day Access Log Datasets
  • 11.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Full day sample of access logs: ● IA 2012 Internet Archive - 2012 (February 2, 2012) ● IA 2019 Internet Archive - 2019 (February 7, 2019) ● PT 2019 Arquivo.pt - 2019 (February 7, 2019) Feature IA 2012 IA 2019 PT 2019 No. of Requests 99,173,542 308,194,916 1,046,855 GET 98.80% 98.68% 97.92% HEAD 1.12% 0.84% 1.37% Status Code 2xx 32.73% 48.26% 26.03% Status Code 3xx 52.57% 42.74% 20.22% Status Code 4xx 11.71% 8.79% 53.58% Status Code 5xx 2.99% 0.20% 0.17% Embedded Resources 43.62% 63.36% 19.68% SI Bot 0.01% 0.15% 0.34% 11 https://web.archive.org/ https://arquivo.pt/ We Used Three Full Day Access Log Datasets
  • 12.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Full day sample of access logs: ● IA 2012 Internet Archive - 2012 (February 2, 2012) ● IA 2019 Internet Archive - 2019 (February 7, 2019) ● PT 2019 Arquivo.pt - 2019 (February 7, 2019) Feature IA 2012 IA 2019 PT 2019 No. of Requests 99,173,542 308,194,916 1,046,855 GET 98.80% 98.68% 97.92% HEAD 1.12% 0.84% 1.37% Status Code 2xx 32.73% 48.26% 26.03% Status Code 3xx 52.57% 42.74% 20.22% Status Code 4xx 11.71% 8.79% 53.58% Status Code 5xx 2.99% 0.20% 0.17% Embedded Resources 43.62% 63.36% 19.68% SI Bot 0.01% 0.15% 0.34% 12 https://web.archive.org/ https://arquivo.pt/ We Used Three Full Day Access Log Datasets
  • 13.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Full day sample of access logs: ● IA 2012 Internet Archive - 2012 (February 2, 2012) ● IA 2019 Internet Archive - 2019 (February 7, 2019) ● PT 2019 Arquivo.pt - 2019 (February 7, 2019) Feature IA 2012 IA 2019 PT 2019 No. of Requests 99,173,542 308,194,916 1,046,855 GET 98.80% 98.68% 97.92% HEAD 1.12% 0.84% 1.37% Status Code 2xx 32.73% 48.26% 26.03% Status Code 3xx 52.57% 42.74% 20.22% Status Code 4xx 11.71% 8.79% 53.58% Status Code 5xx 2.99% 0.20% 0.17% Embedded Resources 43.62% 63.36% 19.68% SI Bot 0.01% 0.15% 0.34% 13 https://web.archive.org/ https://arquivo.pt/ We Used Three Full Day Access Log Datasets
  • 14.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL We Followed a Two-Step Data Cleaning Process Stage 1 ● Remove log entries that were either invalid or irrelevant to the analysis. ○ Everything except requests to Mementos ○ Everything except requests to TimeMaps ○ Kept the requests to the robots.txt of the web archive. Dataset Before (No. of Requests) After Cleaning (No. of Requests) Stage 1 Stage 2 IA 2012 99,173,542 85.22% 18.58% IA 2019 308,194,916 77.19% 11.36% PT 2019 1,046,855 86.40% 57.77% 14 Stage 2 ● Remove log entries that were irrelevant in terms of user behavior. ○ Everything except GET requests ○ Everything except 200, 404, and 503 response codes ○ Embedded resources
  • 15.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL We Divided the Access Logs Into Different Sessions ○ Grouped the requests based on the IP and User-Agent. ○ Divided the requests of each user into individual sessions (timeout threshold: 10 minute) 15 1.1.0.100_0_1 - - [02/Feb/2012:04:36:43 +0000] "GET http://web.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6" 0.0.0.100_0_1 - - [02/Feb/2012:04:36:44 +0000] "GET http://wayback.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 404 2164 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6" 0.0.0.100_0_2 - - [02/Feb/2012:04:58:34 +0000] "GET http://web.archive.org/web/*/LETSMILK.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6" 0.0.0.100_0_2 - - [02/Feb/2012:04:58:37 +0000] "GET http://wayback.archive.org/web/*/LETSMILK.RU HTTP/1.0" 404 2162 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6" 0.0.0.100_0_2 - - [02/Feb/2012:05:00:21 +0000] "GET http://web.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6" 0.0.0.100_0_2 - - [02/Feb/2012:05:00:49 +0000] "GET http://wayback.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 503 2197 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6" The IP addresses are anonymized.
  • 16.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL 16 1.1.0.100_0_1 - - [02/Feb/2012:04:36:43 +0000] "GET http://web.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6" 0.0.0.100_0_1 - - [02/Feb/2012:04:36:44 +0000] "GET http://wayback.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 404 2164 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6" 0.0.0.100_0_2 - - [02/Feb/2012:04:58:34 +0000] "GET http://web.archive.org/web/*/LETSMILK.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6" 0.0.0.100_0_2 - - [02/Feb/2012:04:58:37 +0000] "GET http://wayback.archive.org/web/*/LETSMILK.RU HTTP/1.0" 404 2162 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6" 0.0.0.100_0_2 - - [02/Feb/2012:05:00:21 +0000] "GET http://web.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6" 0.0.0.100_0_2 - - [02/Feb/2012:05:00:49 +0000] "GET http://wayback.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 503 2197 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6" The duration between the two requests > 10 Minutes ○ Grouped the requests based on the IP and User-Agent. ○ Divided the requests of each user into individual sessions (timeout threshold: 10 minute) We Divided the Access Logs Into Different Sessions
  • 17.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Heuristic: The Type of Request (HEAD Requests) 199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100 ● Web browsers issue GET requests for web pages. ● We flagged the requests making HEAD requests as bots. 17
  • 18.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Heuristic: List of Known Bots ● A manually compiled User-Agent list of known bots. ● User-Agents with keywords such as bot, crawler, spider, etc. ● Python module "DeviceDetector", which is a User-Agent parser which will help us determine whether or not the User-Agent is a bot. 199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100 https://pypi.org/project/device-detector/ (DeviceDetector, User-Agent Parser) https://github.com/oduwsdl/access-patterns/blob/main/Known_Bot_List/knownbot.tsv (Final Known Bot List) 18
  • 19.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Heuristic: Number of User-Agents per IP x0.77.87.100 - - [02/Feb/2012:03:46:54 +0000] "POST http://web.archive.org/web/20070211155651/http://212.227.83.57/cproc.aspx HTTP/1.0" 302 0 "http://www.vbleisure.co.uk/guest_book.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0)" 00101000 x0.77.87.100 - - [02/Feb/2012:04:06:29 +0000] "POST http://web.archive.org/web/20070211155651/http://212.227.83.57/cproc.aspx HTTP/1.0" 302 - "http://www.vbleisure.co.uk/guest_book.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)" 00101000 x0.77.87.100 - - [02/Feb/2012:05:09:30 +0000] "POST http://web.archive.org/web/20070211155651/http://212.227.83.57/cproc.aspx HTTP/1.0" 302 - "http://www.vbleisure.co.uk/guest_book.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 00101000 x0.77.87.100 - - [02/Feb/2012:07:59:43 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php HTTP/1.0" 302 0 "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; ru) Opera 8.50" 00101000 . . . . . . . . . x0.77.87.100 - - [02/Feb/2012:22:04:57 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Creative)" 00101000 x0.77.87.100 - - [02/Feb/2012:22:08:02 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)" 00101000 x0.77.87.100 - - [02/Feb/2012:23:40:31 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.0" 00101000 x0.77.87.100 - - [02/Feb/2012:23:40:32 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; MRA 4.6 (build 01425))" 00101000 x0.77.87.100 - - [02/Feb/2012:23:59:34 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Opera/7.60 (Windows NT 5.2; U) [en] (IBM EVV/3.0/EAK01AG9/LE)" 00101000 ● Some of the bots keep changing their User-Agent between requests to avoid being detected as a bot. ● We have flagged requests from IPs that update their User-Agent field more than 20 times as bots. 19
  • 20.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Heuristic: Requests to robots.txt File ● Legitimate bots will typically request robots.txt to determine what they are allowed to crawl. ● We considered a request for the robots.txt file as an indication for a bot request. 0.139.100.213_2_2 - - [02/Feb/2012:17:03:22 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2" 00011000 0.139.100.213_2_2 - - [02/Feb/2012:17:06:30 +0000] "GET http://web.archive.org/web/*/http://c00lbookmarks.com/story.php?title=best-door-blinds-inside HTTP/1.1" 302 0 "-" "RSS Scout 0.9.2" 00001000 0.139.100.213_2_2 - - [02/Feb/2012:17:06:32 +0000] "GET http://wayback.archive.org/web/*/http://c00lbookmarks.com/story.php?title=best-door-blinds-inside HTTP/1.1" 404 2409 "http://web.archive.org/web/*/http://c00lbookmarks.com/story.php?title=best-door-blinds-inside" "RSS Scout 0.9.2" 00001000 0.139.100.213_2_2 - - [02/Feb/2012:17:07:38 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2" 00011000 0.139.100.213_2_2 - - [02/Feb/2012:17:10:44 +0000] "GET http://web.archive.org/web/*/http://www.goloco.org/users/D5EWwXI HTTP/1.1" 302 0 "-" "RSS Scout 0.9.2" 00001000 0.139.100.213_2_2 - - [02/Feb/2012:17:10:45 +0000] "GET http://wayback.archive.org/web/*/http://www.goloco.org/users/D5EWwXI HTTP/1.1" 404 2385 "http://web.archive.org/web/*/http://www.goloco.org/users/D5EWwXI" "RSS Scout 0.9.2" 00001000 0.139.100.213_2_2 - - [02/Feb/2012:17:14:50 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2" 00011000 0.139.100.213_2_2 - - [02/Feb/2012:17:19:54 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2" 00011000 0.139.100.213_2_2 - - [02/Feb/2012:17:27:26 +0000] "GET http://wayback.archive.org/web/*/http://epicbookmarks.com/story.php?title=door-blinds-inside-installation HTTP/1.1" 404 2416 "http://web.archive.org/web/*/http://epicbookmarks.com/story.php?title=door-blinds-inside-installation" "RSS Scout 0.9.2" 00001000 0.139.100.213_2_2 - - [02/Feb/2012:17:27:26 +0000] "GET http://web.archive.org/web/*/http://epicbookmarks.com/story.php?title=door-blinds-inside-installation HTTP/1.1" 302 0 "-" "RSS Scout 0.9.2" 00001000 20
  • 21.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Heuristic: Image to HTML ratio ● Image-to-HTML, is the ratio between the number of image files and the number of HTML files per session. ● Robots tend to retrieve only HTML pages (ignoring images and other embedded resources). Therefore human sessions should have more images than robot sessions. ● We flagged a session with less than one image file for every 10 HTML files as a robot session. 21 http://web.archive.org/web/20220512060725/https://www.odu.edu/ Downloaded using cURL Accessed in the Web Browser
  • 22.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Heuristic: Browsing Speed ● We considered a browsing speed >= 0.5 (requests per second) as a threshold to detect robot sessions. 0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190205174131/https://connect.facebook.net/signals/config/225699104785488?v=2.8.40&r=stable HTTP/1.1" 200 0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207001831/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 200 0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207001831/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 200 0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207001831/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 200 0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207001831/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 200 0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004025/https://connect.facebook.net/en_US/fbevents.js HTTP/1.1" 200 0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004025/https://connect.facebook.net/signals/config/225699104785488?v=2.8.40&r=stable HTTP/1.1" 302 0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004026/https://embed.tawk.to/59cc85aec28eca75e4622ccd/default HTTP/1.1" 200 0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004026/https://embed.tawk.to/59cc85aec28eca75e4622ccd/default HTTP/1.1" 200 0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004026/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 302 . . . . . . 22
  • 23.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Results of Applying the Heuristics Separately to Detect Bots 23 Heuristics IA 2012 IA 2019 PT 2019 Sessions: 1.53M Requests: 22.3M Sessions: 2.7M Requests: 42.9M Sessions: 3.7k Requests: 614k Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0% #UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4% Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7% Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0% Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0% Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0% The number of requests/sessions which had been labeled as robots from each heuristic separately
  • 24.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Image-to-HTML Ratio Had the Largest Effect on Detecting Robots 24 Heuristics IA 2012 IA 2019 PT 2019 Sessions: 1.53M Requests: 22.3M Sessions: 2.7M Requests: 42.9M Sessions: 3.7k Requests: 614k Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0% #UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4% Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7% Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0% Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0% Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0% Image-to-HTML ratio had the largest effect on detecting robots
  • 25.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Total Number of Detected Bots After Applying All the Heuristics Together 25 Heuristics IA 2012 IA 2019 PT 2019 Sessions: 1.53M Requests: 22.3M Sessions: 2.7M Requests: 42.9M Sessions: 3.7k Requests: 614k Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0% #UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4% Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7% Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0% Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0% Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0% The number of requests/sessions detected after applying all the heuristics together.
  • 26.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Potential Reasons for the Increase in Human Sessions in 2019 Than in 2012 26 Heuristics IA 2012 IA 2019 PT 2019 Sessions: 1.53M Requests: 22.3M Sessions: 2.7M Requests: 42.9M Sessions: 3.7k Requests: 614k Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0% #UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4% Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7% Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0% Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0% Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0% The number of requests/sessions detected after applying all the heuristics together. Increase in awareness of web archives among human users in recent years
  • 27.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL 27 Heuristics IA 2012 IA 2019 PT 2019 Sessions: 1.53M Requests: 22.3M Sessions: 2.7M Requests: 42.9M Sessions: 3.7k Requests: 614k Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0% #UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4% Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7% Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0% Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0% Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0% The number of requests/sessions detected after applying all the heuristics together. Increase in popularity of headless browsers set up by, Headless Chromium, PhantomJS, Selenium, and Puppeteer in recent years Potential Reasons for the Increase in Human Sessions in 2019 Than in 2012
  • 28.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL In IA2012, the Robots Were Almost Exclusively Limited To Dip and Skim 28
  • 29.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL The Majority of Requests Are for Mementos Around the Time Each Access Log Sample Was Taken. 29
  • 30.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Out of requests, IA 2012: 91% IA 2019: 70% PT 2019: 98% Out of sessions, IA 2012: 88% IA 2019: 70% PT 2019: 97% Key Takeaways In IA2012, the robots were almost exclusively limited to Dip and Skim, but that in IA2019, they exhibit all of the patterns and their combinations. 30 Dataset/ Feature AlNoamany et al. Our Study Sample Duration 30 minute 24 hrs Web Archives Internet Archive Internet Archive & Arquivo.pt Access Log Year 2012 2012 & 2019 Majority of the requests are for mementos that are close to the datetime of each log sample. The percentage of web archive accesses that were detected as robots. Himarsha R. Jayanetti hjaya002@odu.edu @HimarshaJ
  • 31.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Backup slides … 31
  • 32.
    Robots Still OutnumberHumans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL An Overview of Our Approach and the Steps Followed 32