SlideShare a Scribd company logo
1 of 32
Download to read offline
Robots Still Outnumber Humans in Web
Archives, But Less Than Before
Himarsha R. Jayanetti1
, Kritika Garg1
, Sawood Alam2
, Michael L. Nelson1
, and Michele C. Weigle1
1
Web Science & Digital Libraries Research Group
Old Dominion University, Norfolk VA, USA
@WebSciDL
2
Wayback Machine, Internet Archive
San Francisco, California, USA
@internetarchive
Presented By:
Himarsha R. Jayanetti
Department of Computer Science
Old Dominion University, Norfolk, Virginia
@HimarshaJ @WebSciDL @oducs
TPDL ‘22, The 26th International Conference on Theory and Practice of Digital Libraries, Padua, Italy, 20 - 23 September 2022
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
A Scenario in Which a Human Accesses Web Archives
2
https://web.archive.org/web/20120313134227/http://www.li
b.odu.edu/exhibits/odu75thanniversary/norfolkdivision.htm
https://en.wikipedia.org/wiki/Old_Dominion_University#References
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Another Scenario Is When Bot Services Query Web Archives
3
TimeMap
Visualization
Tool
(TMVis)
https://github.com
/oduwsdl/tmvis
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
4
Slider
GIF
TMViz Visualizes How Individual Webpages Have Changed Over Time
https://web.archive.org/web/2022000
0000000*/http://4genderjustice.org/
TimeMap
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Our Study Is an Extension of a Previous Study
Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson, “Access Patterns for Robots and Humans in Web Archives,” In Proceedings of the
ACM/IEEE Joint Conference on Digital Libraries (JCDL). Indianapolis, IN, July 2013, pp. 339-–348. https://doi.org/10.1145/2467696.2467722
5
● Robots outnumber humans:
○ 10:1 (sessions)
○ 5:4 (raw HTTP accesses)
○ 4:1 (megabytes transferred)
● Robots almost always access TimeMaps, but humans access
the mementos.
● No overall preference for mementos of a particular time, but
the recent past (within the last year) shows significant repeat
accesses.
● Proposed access patterns of web archive users:
○ Dip
○ Slide
○ Dive
○ Skim
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
6
Access Patterns of Web Archive Users: Dip and Dive
Original Resource
(URI-R1
)
URI-R2
URI-R3
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
7
Access Patterns of Web Archive Users: Slide and Skim
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
We Used Three Full Day Access Log Datasets
Full day sample of access logs:
● IA 2012
Internet Archive - 2012
(February 2, 2012)
● IA 2019
Internet Archive - 2019
(February 7, 2019)
● PT 2019
Arquivo.pt - 2019
(February 7, 2019)
Feature IA 2012 IA 2019 PT 2019
No. of Requests 99,173,542 308,194,916 1,046,855
GET 98.80% 98.68% 97.92%
HEAD 1.12% 0.84% 1.37%
Status Code 2xx 32.73% 48.26% 26.03%
Status Code 3xx 52.57% 42.74% 20.22%
Status Code 4xx 11.71% 8.79% 53.58%
Status Code 5xx 2.99% 0.20% 0.17%
Embedded
Resources
43.62% 63.36% 19.68%
SI Bot 0.01% 0.15% 0.34%
8
https://web.archive.org/
https://arquivo.pt/
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Full day sample of access logs:
● IA 2012
Internet Archive - 2012
(February 2, 2012)
● IA 2019
Internet Archive - 2019
(February 7, 2019)
● PT 2019
Arquivo.pt - 2019
(February 7, 2019)
Feature IA 2012 IA 2019 PT 2019
No. of Requests 99,173,542 308,194,916 1,046,855
GET 98.80% 98.68% 97.92%
HEAD 1.12% 0.84% 1.37%
Status Code 2xx 32.73% 48.26% 26.03%
Status Code 3xx 52.57% 42.74% 20.22%
Status Code 4xx 11.71% 8.79% 53.58%
Status Code 5xx 2.99% 0.20% 0.17%
Embedded
Resources
43.62% 63.36% 19.68%
SI Bot 0.01% 0.15% 0.34%
9
https://web.archive.org/
https://arquivo.pt/
We Used Three Full Day Access Log Datasets
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Full day sample of access logs:
● IA 2012
Internet Archive - 2012
(February 2, 2012)
● IA 2019
Internet Archive - 2019
(February 7, 2019)
● PT 2019
Arquivo.pt - 2019
(February 7, 2019)
Feature IA 2012 IA 2019 PT 2019
No. of Requests 99,173,542 308,194,916 1,046,855
GET 98.80% 98.68% 97.92%
HEAD 1.12% 0.84% 1.37%
Status Code 2xx 32.73% 48.26% 26.03%
Status Code 3xx 52.57% 42.74% 20.22%
Status Code 4xx 11.71% 8.79% 53.58%
Status Code 5xx 2.99% 0.20% 0.17%
Embedded
Resources
43.62% 63.36% 19.68%
SI Bot 0.01% 0.15% 0.34%
10
https://web.archive.org/
https://arquivo.pt/
We Used Three Full Day Access Log Datasets
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Full day sample of access logs:
● IA 2012
Internet Archive - 2012
(February 2, 2012)
● IA 2019
Internet Archive - 2019
(February 7, 2019)
● PT 2019
Arquivo.pt - 2019
(February 7, 2019)
Feature IA 2012 IA 2019 PT 2019
No. of Requests 99,173,542 308,194,916 1,046,855
GET 98.80% 98.68% 97.92%
HEAD 1.12% 0.84% 1.37%
Status Code 2xx 32.73% 48.26% 26.03%
Status Code 3xx 52.57% 42.74% 20.22%
Status Code 4xx 11.71% 8.79% 53.58%
Status Code 5xx 2.99% 0.20% 0.17%
Embedded
Resources
43.62% 63.36% 19.68%
SI Bot 0.01% 0.15% 0.34%
11
https://web.archive.org/
https://arquivo.pt/
We Used Three Full Day Access Log Datasets
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Full day sample of access logs:
● IA 2012
Internet Archive - 2012
(February 2, 2012)
● IA 2019
Internet Archive - 2019
(February 7, 2019)
● PT 2019
Arquivo.pt - 2019
(February 7, 2019)
Feature IA 2012 IA 2019 PT 2019
No. of Requests 99,173,542 308,194,916 1,046,855
GET 98.80% 98.68% 97.92%
HEAD 1.12% 0.84% 1.37%
Status Code 2xx 32.73% 48.26% 26.03%
Status Code 3xx 52.57% 42.74% 20.22%
Status Code 4xx 11.71% 8.79% 53.58%
Status Code 5xx 2.99% 0.20% 0.17%
Embedded
Resources
43.62% 63.36% 19.68%
SI Bot 0.01% 0.15% 0.34%
12
https://web.archive.org/
https://arquivo.pt/
We Used Three Full Day Access Log Datasets
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Full day sample of access logs:
● IA 2012
Internet Archive - 2012
(February 2, 2012)
● IA 2019
Internet Archive - 2019
(February 7, 2019)
● PT 2019
Arquivo.pt - 2019
(February 7, 2019)
Feature IA 2012 IA 2019 PT 2019
No. of Requests 99,173,542 308,194,916 1,046,855
GET 98.80% 98.68% 97.92%
HEAD 1.12% 0.84% 1.37%
Status Code 2xx 32.73% 48.26% 26.03%
Status Code 3xx 52.57% 42.74% 20.22%
Status Code 4xx 11.71% 8.79% 53.58%
Status Code 5xx 2.99% 0.20% 0.17%
Embedded
Resources
43.62% 63.36% 19.68%
SI Bot 0.01% 0.15% 0.34%
13
https://web.archive.org/
https://arquivo.pt/
We Used Three Full Day Access Log Datasets
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
We Followed a Two-Step Data Cleaning Process
Stage 1
● Remove log entries that were either invalid or
irrelevant to the analysis.
○ Everything except requests to Mementos
○ Everything except requests to TimeMaps
○ Kept the requests to the robots.txt of the
web archive.
Dataset Before
(No. of Requests)
After Cleaning (No. of Requests)
Stage 1 Stage 2
IA 2012 99,173,542 85.22% 18.58%
IA 2019 308,194,916 77.19% 11.36%
PT 2019 1,046,855 86.40% 57.77%
14
Stage 2
● Remove log entries that were irrelevant in
terms of user behavior.
○ Everything except GET requests
○ Everything except 200, 404, and 503
response codes
○ Embedded resources
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
We Divided the Access Logs Into Different Sessions
○ Grouped the requests based on the IP and User-Agent.
○ Divided the requests of each user into individual sessions (timeout threshold: 10 minute)
15
1.1.0.100_0_1 - - [02/Feb/2012:04:36:43 +0000] "GET http://web.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0
(Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_1 - - [02/Feb/2012:04:36:44 +0000] "GET http://wayback.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 404 2164 "-"
"Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:04:58:34 +0000] "GET http://web.archive.org/web/*/LETSMILK.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0
(Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:04:58:37 +0000] "GET http://wayback.archive.org/web/*/LETSMILK.RU HTTP/1.0" 404 2162 "-"
"Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:05:00:21 +0000] "GET http://web.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0
(Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:05:00:49 +0000] "GET http://wayback.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 503 2197 "-"
"Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
The IP addresses
are anonymized.
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
16
1.1.0.100_0_1 - - [02/Feb/2012:04:36:43 +0000] "GET http://web.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0
(Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_1 - - [02/Feb/2012:04:36:44 +0000] "GET http://wayback.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 404 2164 "-"
"Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:04:58:34 +0000] "GET http://web.archive.org/web/*/LETSMILK.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0
(Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:04:58:37 +0000] "GET http://wayback.archive.org/web/*/LETSMILK.RU HTTP/1.0" 404 2162 "-"
"Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:05:00:21 +0000] "GET http://web.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0
(Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:05:00:49 +0000] "GET http://wayback.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 503 2197 "-"
"Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
The duration between
the two requests
> 10 Minutes
○ Grouped the requests based on the IP and User-Agent.
○ Divided the requests of each user into individual sessions (timeout threshold: 10 minute)
We Divided the Access Logs Into Different Sessions
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Heuristic: The Type of Request (HEAD Requests)
199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
● Web browsers issue GET requests for web pages.
● We flagged the requests making HEAD requests as bots.
17
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Heuristic: List of Known Bots
● A manually compiled User-Agent list of known bots.
● User-Agents with keywords such as bot, crawler, spider, etc.
● Python module "DeviceDetector", which is a User-Agent parser which will help us determine
whether or not the User-Agent is a bot.
199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
https://pypi.org/project/device-detector/ (DeviceDetector, User-Agent Parser)
https://github.com/oduwsdl/access-patterns/blob/main/Known_Bot_List/knownbot.tsv (Final Known Bot List)
18
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Heuristic: Number of User-Agents per IP
x0.77.87.100 - - [02/Feb/2012:03:46:54 +0000] "POST http://web.archive.org/web/20070211155651/http://212.227.83.57/cproc.aspx HTTP/1.0" 302 0
"http://www.vbleisure.co.uk/guest_book.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0)"
00101000
x0.77.87.100 - - [02/Feb/2012:04:06:29 +0000] "POST http://web.archive.org/web/20070211155651/http://212.227.83.57/cproc.aspx HTTP/1.0" 302 -
"http://www.vbleisure.co.uk/guest_book.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
00101000
x0.77.87.100 - - [02/Feb/2012:05:09:30 +0000] "POST http://web.archive.org/web/20070211155651/http://212.227.83.57/cproc.aspx HTTP/1.0" 302 -
"http://www.vbleisure.co.uk/guest_book.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
00101000
x0.77.87.100 - - [02/Feb/2012:07:59:43 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php
HTTP/1.0" 302 0 "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; ru) Opera 8.50" 00101000
. . .
. . .
. . .
x0.77.87.100 - - [02/Feb/2012:22:04:57 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php
HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Creative)"
00101000
x0.77.87.100 - - [02/Feb/2012:22:08:02 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php
HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)"
00101000
x0.77.87.100 - - [02/Feb/2012:23:40:31 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php
HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.0" 00101000
x0.77.87.100 - - [02/Feb/2012:23:40:32 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php
HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; MRA 4.6 (build 01425))" 00101000
x0.77.87.100 - - [02/Feb/2012:23:59:34 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php
HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Opera/7.60 (Windows NT 5.2; U) [en] (IBM EVV/3.0/EAK01AG9/LE)" 00101000
● Some of the bots keep changing their User-Agent between requests to avoid being detected as a bot.
● We have flagged requests from IPs that update their User-Agent field more than 20 times as bots.
19
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Heuristic: Requests to robots.txt File
● Legitimate bots will typically request robots.txt to determine what they are allowed to crawl.
● We considered a request for the robots.txt file as an indication for a bot request.
0.139.100.213_2_2 - - [02/Feb/2012:17:03:22 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2"
00011000
0.139.100.213_2_2 - - [02/Feb/2012:17:06:30 +0000] "GET http://web.archive.org/web/*/http://c00lbookmarks.com/story.php?title=best-door-blinds-inside
HTTP/1.1" 302 0 "-" "RSS Scout 0.9.2" 00001000
0.139.100.213_2_2 - - [02/Feb/2012:17:06:32 +0000] "GET http://wayback.archive.org/web/*/http://c00lbookmarks.com/story.php?title=best-door-blinds-inside
HTTP/1.1" 404 2409 "http://web.archive.org/web/*/http://c00lbookmarks.com/story.php?title=best-door-blinds-inside" "RSS Scout 0.9.2"
00001000
0.139.100.213_2_2 - - [02/Feb/2012:17:07:38 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2"
00011000
0.139.100.213_2_2 - - [02/Feb/2012:17:10:44 +0000] "GET http://web.archive.org/web/*/http://www.goloco.org/users/D5EWwXI HTTP/1.1" 302 0 "-" "RSS Scout
0.9.2" 00001000
0.139.100.213_2_2 - - [02/Feb/2012:17:10:45 +0000] "GET http://wayback.archive.org/web/*/http://www.goloco.org/users/D5EWwXI HTTP/1.1" 404 2385
"http://web.archive.org/web/*/http://www.goloco.org/users/D5EWwXI" "RSS Scout 0.9.2" 00001000
0.139.100.213_2_2 - - [02/Feb/2012:17:14:50 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2"
00011000
0.139.100.213_2_2 - - [02/Feb/2012:17:19:54 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2"
00011000
0.139.100.213_2_2 - - [02/Feb/2012:17:27:26 +0000] "GET
http://wayback.archive.org/web/*/http://epicbookmarks.com/story.php?title=door-blinds-inside-installation HTTP/1.1" 404 2416
"http://web.archive.org/web/*/http://epicbookmarks.com/story.php?title=door-blinds-inside-installation" "RSS Scout 0.9.2"
00001000
0.139.100.213_2_2 - - [02/Feb/2012:17:27:26 +0000] "GET
http://web.archive.org/web/*/http://epicbookmarks.com/story.php?title=door-blinds-inside-installation HTTP/1.1" 302 0 "-" "RSS Scout 0.9.2"
00001000
20
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Heuristic: Image to HTML ratio
● Image-to-HTML, is the ratio between the number of image files and the number of HTML files per session.
● Robots tend to retrieve only HTML pages (ignoring images and other embedded resources). Therefore human
sessions should have more images than robot sessions.
● We flagged a session with less than one image file for every 10 HTML files as a robot session.
21
http://web.archive.org/web/20220512060725/https://www.odu.edu/
Downloaded
using cURL
Accessed in the
Web Browser
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Heuristic: Browsing Speed
● We considered a browsing speed >= 0.5 (requests per second) as a threshold to detect robot sessions.
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET
/web/20190205174131/https://connect.facebook.net/signals/config/225699104785488?v=2.8.40&r=stable HTTP/1.1" 200
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET
/web/20190207001831/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 200
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET
/web/20190207001831/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 200
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET
/web/20190207001831/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 200
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET
/web/20190207001831/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 200
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004025/https://connect.facebook.net/en_US/fbevents.js HTTP/1.1" 200
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET
/web/20190207004025/https://connect.facebook.net/signals/config/225699104785488?v=2.8.40&r=stable HTTP/1.1" 302
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004026/https://embed.tawk.to/59cc85aec28eca75e4622ccd/default HTTP/1.1"
200
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004026/https://embed.tawk.to/59cc85aec28eca75e4622ccd/default HTTP/1.1"
200
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET
/web/20190207004026/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 302
. . .
. . .
22
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Results of Applying the Heuristics Separately to Detect Bots
23
Heuristics
IA 2012 IA 2019 PT 2019
Sessions:
1.53M
Requests:
22.3M
Sessions:
2.7M
Requests:
42.9M
Sessions:
3.7k
Requests:
614k
Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0%
#UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4%
Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7%
Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0%
Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0%
Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0%
The number of requests/sessions which had been
labeled as robots from each heuristic separately
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Image-to-HTML Ratio Had the Largest Effect on
Detecting Robots
24
Heuristics
IA 2012 IA 2019 PT 2019
Sessions:
1.53M
Requests:
22.3M
Sessions:
2.7M
Requests:
42.9M
Sessions:
3.7k
Requests:
614k
Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0%
#UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4%
Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7%
Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0%
Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0%
Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0%
Image-to-HTML ratio had the largest effect on detecting robots
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Total Number of Detected Bots After Applying All the
Heuristics Together
25
Heuristics
IA 2012 IA 2019 PT 2019
Sessions:
1.53M
Requests:
22.3M
Sessions:
2.7M
Requests:
42.9M
Sessions:
3.7k
Requests:
614k
Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0%
#UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4%
Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7%
Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0%
Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0%
Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0%
The number of requests/sessions detected after applying all the heuristics together.
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Potential Reasons for the Increase in Human Sessions
in 2019 Than in 2012
26
Heuristics
IA 2012 IA 2019 PT 2019
Sessions:
1.53M
Requests:
22.3M
Sessions:
2.7M
Requests:
42.9M
Sessions:
3.7k
Requests:
614k
Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0%
#UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4%
Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7%
Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0%
Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0%
Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0%
The number of requests/sessions detected after applying all the heuristics together.
Increase in awareness of web archives among human users in recent years
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
27
Heuristics
IA 2012 IA 2019 PT 2019
Sessions:
1.53M
Requests:
22.3M
Sessions:
2.7M
Requests:
42.9M
Sessions:
3.7k
Requests:
614k
Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0%
#UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4%
Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7%
Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0%
Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0%
Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0%
The number of requests/sessions detected after applying all the heuristics together.
Increase in popularity of headless browsers set up by, Headless Chromium,
PhantomJS, Selenium, and Puppeteer in recent years
Potential Reasons for the Increase in Human Sessions
in 2019 Than in 2012
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
In IA2012, the Robots Were Almost
Exclusively Limited To Dip and Skim
28
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
The Majority of Requests Are for Mementos Around
the Time Each Access Log Sample Was Taken.
29
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Out of requests,
IA 2012: 91%
IA 2019: 70%
PT 2019: 98%
Out of sessions,
IA 2012: 88%
IA 2019: 70%
PT 2019: 97%
Key Takeaways
In IA2012, the robots were
almost exclusively limited to Dip
and Skim, but that in IA2019,
they exhibit all of the patterns
and their combinations.
30
Dataset/
Feature
AlNoamany et al. Our Study
Sample
Duration
30 minute 24 hrs
Web Archives Internet Archive Internet Archive &
Arquivo.pt
Access Log
Year
2012 2012 & 2019
Majority of the
requests are for
mementos that
are close to the
datetime of each
log sample.
The percentage
of web archive
accesses that
were detected
as robots.
Himarsha R. Jayanetti
hjaya002@odu.edu
@HimarshaJ
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Backup slides …
31
Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
An Overview of Our Approach and the Steps Followed
32

More Related Content

Similar to Robots Still Outnumber Humans in Web Archives, But Less Than Before

Scientific Software Registry Collaboration Workshop: From Software Metadata r...
Scientific Software Registry Collaboration Workshop: From Software Metadata r...Scientific Software Registry Collaboration Workshop: From Software Metadata r...
Scientific Software Registry Collaboration Workshop: From Software Metadata r...dgarijo
 
Catania Science Gateway Framework
Catania Science Gateway Framework Catania Science Gateway Framework
Catania Science Gateway Framework riround
 
Edge patterns in the IIoT
Edge patterns in the IIoTEdge patterns in the IIoT
Edge patterns in the IIoTBrad Nicholas
 
H2O Machine Learning Use Cases
H2O Machine Learning Use CasesH2O Machine Learning Use Cases
H2O Machine Learning Use CasesJo-fai Chow
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
SQL Server 2008 R2 StreamInsight
SQL Server 2008 R2 StreamInsightSQL Server 2008 R2 StreamInsight
SQL Server 2008 R2 StreamInsightEduardo Castro
 
WSO2 Big Data Platform and Applications
WSO2 Big Data Platform and ApplicationsWSO2 Big Data Platform and Applications
WSO2 Big Data Platform and ApplicationsSrinath Perera
 
Authentication and Tracking of Government Benefits Using Blockchain
Authentication and Tracking of Government Benefits Using BlockchainAuthentication and Tracking of Government Benefits Using Blockchain
Authentication and Tracking of Government Benefits Using BlockchainIRJET Journal
 
Lecture1_Introduction.pptx
Lecture1_Introduction.pptxLecture1_Introduction.pptx
Lecture1_Introduction.pptxishwar69
 
STUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUE
STUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUESTUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUE
STUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUEIAEME Publication
 
Internet of things (IoT) and big data- r.nabati
Internet of things (IoT) and big data- r.nabatiInternet of things (IoT) and big data- r.nabati
Internet of things (IoT) and big data- r.nabatinabati
 
Reactive Java Robotics IoT - jPrime 2016
Reactive Java Robotics IoT - jPrime 2016Reactive Java Robotics IoT - jPrime 2016
Reactive Java Robotics IoT - jPrime 2016Trayan Iliev
 
IRJET - Food Supply Chain Management using Blockchain in Food Traceability
IRJET - Food Supply Chain Management using Blockchain in Food TraceabilityIRJET - Food Supply Chain Management using Blockchain in Food Traceability
IRJET - Food Supply Chain Management using Blockchain in Food TraceabilityIRJET Journal
 
Treasure Data Cloud Data Platform
Treasure Data Cloud Data PlatformTreasure Data Cloud Data Platform
Treasure Data Cloud Data Platforminside-BigData.com
 
IRJET -Securing Data in Distributed System using Blockchain and AI
IRJET -Securing Data in Distributed System using Blockchain and AIIRJET -Securing Data in Distributed System using Blockchain and AI
IRJET -Securing Data in Distributed System using Blockchain and AIIRJET Journal
 
Science Gateways: one portal, many e-Infrastructures and related services
Science Gateways: one portal, many e-Infrastructures and related servicesScience Gateways: one portal, many e-Infrastructures and related services
Science Gateways: one portal, many e-Infrastructures and related servicesriround
 
INTERFACE, by apidays - The Evolution of Data Movement.pdf
INTERFACE, by apidays - The Evolution of Data Movement.pdfINTERFACE, by apidays - The Evolution of Data Movement.pdf
INTERFACE, by apidays - The Evolution of Data Movement.pdfapidays
 
The road to monitoring Nirvana
The road to monitoring NirvanaThe road to monitoring Nirvana
The road to monitoring NirvanaPedro Araújo
 
Primers or Reminders? The Effects of Existing Review Comments on Code Review
Primers or Reminders? The Effects of Existing Review Comments on Code ReviewPrimers or Reminders? The Effects of Existing Review Comments on Code Review
Primers or Reminders? The Effects of Existing Review Comments on Code ReviewDelft University of Technology
 

Similar to Robots Still Outnumber Humans in Web Archives, But Less Than Before (20)

Scientific Software Registry Collaboration Workshop: From Software Metadata r...
Scientific Software Registry Collaboration Workshop: From Software Metadata r...Scientific Software Registry Collaboration Workshop: From Software Metadata r...
Scientific Software Registry Collaboration Workshop: From Software Metadata r...
 
Catania Science Gateway Framework
Catania Science Gateway Framework Catania Science Gateway Framework
Catania Science Gateway Framework
 
Edge patterns in the IIoT
Edge patterns in the IIoTEdge patterns in the IIoT
Edge patterns in the IIoT
 
H2O Machine Learning Use Cases
H2O Machine Learning Use CasesH2O Machine Learning Use Cases
H2O Machine Learning Use Cases
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
SQL Server 2008 R2 StreamInsight
SQL Server 2008 R2 StreamInsightSQL Server 2008 R2 StreamInsight
SQL Server 2008 R2 StreamInsight
 
WSO2 Big Data Platform and Applications
WSO2 Big Data Platform and ApplicationsWSO2 Big Data Platform and Applications
WSO2 Big Data Platform and Applications
 
Authentication and Tracking of Government Benefits Using Blockchain
Authentication and Tracking of Government Benefits Using BlockchainAuthentication and Tracking of Government Benefits Using Blockchain
Authentication and Tracking of Government Benefits Using Blockchain
 
Lecture1_Introduction.pptx
Lecture1_Introduction.pptxLecture1_Introduction.pptx
Lecture1_Introduction.pptx
 
STUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUE
STUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUESTUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUE
STUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUE
 
Internet of things (IoT) and big data- r.nabati
Internet of things (IoT) and big data- r.nabatiInternet of things (IoT) and big data- r.nabati
Internet of things (IoT) and big data- r.nabati
 
Reactive Java Robotics IoT - jPrime 2016
Reactive Java Robotics IoT - jPrime 2016Reactive Java Robotics IoT - jPrime 2016
Reactive Java Robotics IoT - jPrime 2016
 
IRJET - Food Supply Chain Management using Blockchain in Food Traceability
IRJET - Food Supply Chain Management using Blockchain in Food TraceabilityIRJET - Food Supply Chain Management using Blockchain in Food Traceability
IRJET - Food Supply Chain Management using Blockchain in Food Traceability
 
Treasure Data Cloud Data Platform
Treasure Data Cloud Data PlatformTreasure Data Cloud Data Platform
Treasure Data Cloud Data Platform
 
IRJET -Securing Data in Distributed System using Blockchain and AI
IRJET -Securing Data in Distributed System using Blockchain and AIIRJET -Securing Data in Distributed System using Blockchain and AI
IRJET -Securing Data in Distributed System using Blockchain and AI
 
Science Gateways: one portal, many e-Infrastructures and related services
Science Gateways: one portal, many e-Infrastructures and related servicesScience Gateways: one portal, many e-Infrastructures and related services
Science Gateways: one portal, many e-Infrastructures and related services
 
INTERFACE, by apidays - The Evolution of Data Movement.pdf
INTERFACE, by apidays - The Evolution of Data Movement.pdfINTERFACE, by apidays - The Evolution of Data Movement.pdf
INTERFACE, by apidays - The Evolution of Data Movement.pdf
 
The road to monitoring Nirvana
The road to monitoring NirvanaThe road to monitoring Nirvana
The road to monitoring Nirvana
 
Web2.0!
Web2.0!Web2.0!
Web2.0!
 
Primers or Reminders? The Effects of Existing Review Comments on Code Review
Primers or Reminders? The Effects of Existing Review Comments on Code ReviewPrimers or Reminders? The Effects of Existing Review Comments on Code Review
Primers or Reminders? The Effects of Existing Review Comments on Code Review
 

Recently uploaded

dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flyPRADYUMMAURYA1
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Mohammad Khajehpour
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 

Recently uploaded (20)

dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 

Robots Still Outnumber Humans in Web Archives, But Less Than Before

  • 1. Robots Still Outnumber Humans in Web Archives, But Less Than Before Himarsha R. Jayanetti1 , Kritika Garg1 , Sawood Alam2 , Michael L. Nelson1 , and Michele C. Weigle1 1 Web Science & Digital Libraries Research Group Old Dominion University, Norfolk VA, USA @WebSciDL 2 Wayback Machine, Internet Archive San Francisco, California, USA @internetarchive Presented By: Himarsha R. Jayanetti Department of Computer Science Old Dominion University, Norfolk, Virginia @HimarshaJ @WebSciDL @oducs TPDL ‘22, The 26th International Conference on Theory and Practice of Digital Libraries, Padua, Italy, 20 - 23 September 2022
  • 2. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL A Scenario in Which a Human Accesses Web Archives 2 https://web.archive.org/web/20120313134227/http://www.li b.odu.edu/exhibits/odu75thanniversary/norfolkdivision.htm https://en.wikipedia.org/wiki/Old_Dominion_University#References
  • 3. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Another Scenario Is When Bot Services Query Web Archives 3 TimeMap Visualization Tool (TMVis) https://github.com /oduwsdl/tmvis
  • 4. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL 4 Slider GIF TMViz Visualizes How Individual Webpages Have Changed Over Time https://web.archive.org/web/2022000 0000000*/http://4genderjustice.org/ TimeMap
  • 5. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Our Study Is an Extension of a Previous Study Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson, “Access Patterns for Robots and Humans in Web Archives,” In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL). Indianapolis, IN, July 2013, pp. 339-–348. https://doi.org/10.1145/2467696.2467722 5 ● Robots outnumber humans: ○ 10:1 (sessions) ○ 5:4 (raw HTTP accesses) ○ 4:1 (megabytes transferred) ● Robots almost always access TimeMaps, but humans access the mementos. ● No overall preference for mementos of a particular time, but the recent past (within the last year) shows significant repeat accesses. ● Proposed access patterns of web archive users: ○ Dip ○ Slide ○ Dive ○ Skim
  • 6. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL 6 Access Patterns of Web Archive Users: Dip and Dive Original Resource (URI-R1 ) URI-R2 URI-R3
  • 7. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL 7 Access Patterns of Web Archive Users: Slide and Skim
  • 8. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL We Used Three Full Day Access Log Datasets Full day sample of access logs: ● IA 2012 Internet Archive - 2012 (February 2, 2012) ● IA 2019 Internet Archive - 2019 (February 7, 2019) ● PT 2019 Arquivo.pt - 2019 (February 7, 2019) Feature IA 2012 IA 2019 PT 2019 No. of Requests 99,173,542 308,194,916 1,046,855 GET 98.80% 98.68% 97.92% HEAD 1.12% 0.84% 1.37% Status Code 2xx 32.73% 48.26% 26.03% Status Code 3xx 52.57% 42.74% 20.22% Status Code 4xx 11.71% 8.79% 53.58% Status Code 5xx 2.99% 0.20% 0.17% Embedded Resources 43.62% 63.36% 19.68% SI Bot 0.01% 0.15% 0.34% 8 https://web.archive.org/ https://arquivo.pt/
  • 9. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Full day sample of access logs: ● IA 2012 Internet Archive - 2012 (February 2, 2012) ● IA 2019 Internet Archive - 2019 (February 7, 2019) ● PT 2019 Arquivo.pt - 2019 (February 7, 2019) Feature IA 2012 IA 2019 PT 2019 No. of Requests 99,173,542 308,194,916 1,046,855 GET 98.80% 98.68% 97.92% HEAD 1.12% 0.84% 1.37% Status Code 2xx 32.73% 48.26% 26.03% Status Code 3xx 52.57% 42.74% 20.22% Status Code 4xx 11.71% 8.79% 53.58% Status Code 5xx 2.99% 0.20% 0.17% Embedded Resources 43.62% 63.36% 19.68% SI Bot 0.01% 0.15% 0.34% 9 https://web.archive.org/ https://arquivo.pt/ We Used Three Full Day Access Log Datasets
  • 10. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Full day sample of access logs: ● IA 2012 Internet Archive - 2012 (February 2, 2012) ● IA 2019 Internet Archive - 2019 (February 7, 2019) ● PT 2019 Arquivo.pt - 2019 (February 7, 2019) Feature IA 2012 IA 2019 PT 2019 No. of Requests 99,173,542 308,194,916 1,046,855 GET 98.80% 98.68% 97.92% HEAD 1.12% 0.84% 1.37% Status Code 2xx 32.73% 48.26% 26.03% Status Code 3xx 52.57% 42.74% 20.22% Status Code 4xx 11.71% 8.79% 53.58% Status Code 5xx 2.99% 0.20% 0.17% Embedded Resources 43.62% 63.36% 19.68% SI Bot 0.01% 0.15% 0.34% 10 https://web.archive.org/ https://arquivo.pt/ We Used Three Full Day Access Log Datasets
  • 11. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Full day sample of access logs: ● IA 2012 Internet Archive - 2012 (February 2, 2012) ● IA 2019 Internet Archive - 2019 (February 7, 2019) ● PT 2019 Arquivo.pt - 2019 (February 7, 2019) Feature IA 2012 IA 2019 PT 2019 No. of Requests 99,173,542 308,194,916 1,046,855 GET 98.80% 98.68% 97.92% HEAD 1.12% 0.84% 1.37% Status Code 2xx 32.73% 48.26% 26.03% Status Code 3xx 52.57% 42.74% 20.22% Status Code 4xx 11.71% 8.79% 53.58% Status Code 5xx 2.99% 0.20% 0.17% Embedded Resources 43.62% 63.36% 19.68% SI Bot 0.01% 0.15% 0.34% 11 https://web.archive.org/ https://arquivo.pt/ We Used Three Full Day Access Log Datasets
  • 12. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Full day sample of access logs: ● IA 2012 Internet Archive - 2012 (February 2, 2012) ● IA 2019 Internet Archive - 2019 (February 7, 2019) ● PT 2019 Arquivo.pt - 2019 (February 7, 2019) Feature IA 2012 IA 2019 PT 2019 No. of Requests 99,173,542 308,194,916 1,046,855 GET 98.80% 98.68% 97.92% HEAD 1.12% 0.84% 1.37% Status Code 2xx 32.73% 48.26% 26.03% Status Code 3xx 52.57% 42.74% 20.22% Status Code 4xx 11.71% 8.79% 53.58% Status Code 5xx 2.99% 0.20% 0.17% Embedded Resources 43.62% 63.36% 19.68% SI Bot 0.01% 0.15% 0.34% 12 https://web.archive.org/ https://arquivo.pt/ We Used Three Full Day Access Log Datasets
  • 13. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Full day sample of access logs: ● IA 2012 Internet Archive - 2012 (February 2, 2012) ● IA 2019 Internet Archive - 2019 (February 7, 2019) ● PT 2019 Arquivo.pt - 2019 (February 7, 2019) Feature IA 2012 IA 2019 PT 2019 No. of Requests 99,173,542 308,194,916 1,046,855 GET 98.80% 98.68% 97.92% HEAD 1.12% 0.84% 1.37% Status Code 2xx 32.73% 48.26% 26.03% Status Code 3xx 52.57% 42.74% 20.22% Status Code 4xx 11.71% 8.79% 53.58% Status Code 5xx 2.99% 0.20% 0.17% Embedded Resources 43.62% 63.36% 19.68% SI Bot 0.01% 0.15% 0.34% 13 https://web.archive.org/ https://arquivo.pt/ We Used Three Full Day Access Log Datasets
  • 14. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL We Followed a Two-Step Data Cleaning Process Stage 1 ● Remove log entries that were either invalid or irrelevant to the analysis. ○ Everything except requests to Mementos ○ Everything except requests to TimeMaps ○ Kept the requests to the robots.txt of the web archive. Dataset Before (No. of Requests) After Cleaning (No. of Requests) Stage 1 Stage 2 IA 2012 99,173,542 85.22% 18.58% IA 2019 308,194,916 77.19% 11.36% PT 2019 1,046,855 86.40% 57.77% 14 Stage 2 ● Remove log entries that were irrelevant in terms of user behavior. ○ Everything except GET requests ○ Everything except 200, 404, and 503 response codes ○ Embedded resources
  • 15. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL We Divided the Access Logs Into Different Sessions ○ Grouped the requests based on the IP and User-Agent. ○ Divided the requests of each user into individual sessions (timeout threshold: 10 minute) 15 1.1.0.100_0_1 - - [02/Feb/2012:04:36:43 +0000] "GET http://web.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6" 0.0.0.100_0_1 - - [02/Feb/2012:04:36:44 +0000] "GET http://wayback.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 404 2164 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6" 0.0.0.100_0_2 - - [02/Feb/2012:04:58:34 +0000] "GET http://web.archive.org/web/*/LETSMILK.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6" 0.0.0.100_0_2 - - [02/Feb/2012:04:58:37 +0000] "GET http://wayback.archive.org/web/*/LETSMILK.RU HTTP/1.0" 404 2162 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6" 0.0.0.100_0_2 - - [02/Feb/2012:05:00:21 +0000] "GET http://web.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6" 0.0.0.100_0_2 - - [02/Feb/2012:05:00:49 +0000] "GET http://wayback.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 503 2197 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6" The IP addresses are anonymized.
  • 16. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL 16 1.1.0.100_0_1 - - [02/Feb/2012:04:36:43 +0000] "GET http://web.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6" 0.0.0.100_0_1 - - [02/Feb/2012:04:36:44 +0000] "GET http://wayback.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 404 2164 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6" 0.0.0.100_0_2 - - [02/Feb/2012:04:58:34 +0000] "GET http://web.archive.org/web/*/LETSMILK.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6" 0.0.0.100_0_2 - - [02/Feb/2012:04:58:37 +0000] "GET http://wayback.archive.org/web/*/LETSMILK.RU HTTP/1.0" 404 2162 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6" 0.0.0.100_0_2 - - [02/Feb/2012:05:00:21 +0000] "GET http://web.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6" 0.0.0.100_0_2 - - [02/Feb/2012:05:00:49 +0000] "GET http://wayback.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 503 2197 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6" The duration between the two requests > 10 Minutes ○ Grouped the requests based on the IP and User-Agent. ○ Divided the requests of each user into individual sessions (timeout threshold: 10 minute) We Divided the Access Logs Into Different Sessions
  • 17. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Heuristic: The Type of Request (HEAD Requests) 199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100 ● Web browsers issue GET requests for web pages. ● We flagged the requests making HEAD requests as bots. 17
  • 18. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Heuristic: List of Known Bots ● A manually compiled User-Agent list of known bots. ● User-Agents with keywords such as bot, crawler, spider, etc. ● Python module "DeviceDetector", which is a User-Agent parser which will help us determine whether or not the User-Agent is a bot. 199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100 199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100 https://pypi.org/project/device-detector/ (DeviceDetector, User-Agent Parser) https://github.com/oduwsdl/access-patterns/blob/main/Known_Bot_List/knownbot.tsv (Final Known Bot List) 18
  • 19. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Heuristic: Number of User-Agents per IP x0.77.87.100 - - [02/Feb/2012:03:46:54 +0000] "POST http://web.archive.org/web/20070211155651/http://212.227.83.57/cproc.aspx HTTP/1.0" 302 0 "http://www.vbleisure.co.uk/guest_book.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0)" 00101000 x0.77.87.100 - - [02/Feb/2012:04:06:29 +0000] "POST http://web.archive.org/web/20070211155651/http://212.227.83.57/cproc.aspx HTTP/1.0" 302 - "http://www.vbleisure.co.uk/guest_book.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)" 00101000 x0.77.87.100 - - [02/Feb/2012:05:09:30 +0000] "POST http://web.archive.org/web/20070211155651/http://212.227.83.57/cproc.aspx HTTP/1.0" 302 - "http://www.vbleisure.co.uk/guest_book.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 00101000 x0.77.87.100 - - [02/Feb/2012:07:59:43 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php HTTP/1.0" 302 0 "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; ru) Opera 8.50" 00101000 . . . . . . . . . x0.77.87.100 - - [02/Feb/2012:22:04:57 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Creative)" 00101000 x0.77.87.100 - - [02/Feb/2012:22:08:02 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)" 00101000 x0.77.87.100 - - [02/Feb/2012:23:40:31 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.0" 00101000 x0.77.87.100 - - [02/Feb/2012:23:40:32 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; MRA 4.6 (build 01425))" 00101000 x0.77.87.100 - - [02/Feb/2012:23:59:34 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Opera/7.60 (Windows NT 5.2; U) [en] (IBM EVV/3.0/EAK01AG9/LE)" 00101000 ● Some of the bots keep changing their User-Agent between requests to avoid being detected as a bot. ● We have flagged requests from IPs that update their User-Agent field more than 20 times as bots. 19
  • 20. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Heuristic: Requests to robots.txt File ● Legitimate bots will typically request robots.txt to determine what they are allowed to crawl. ● We considered a request for the robots.txt file as an indication for a bot request. 0.139.100.213_2_2 - - [02/Feb/2012:17:03:22 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2" 00011000 0.139.100.213_2_2 - - [02/Feb/2012:17:06:30 +0000] "GET http://web.archive.org/web/*/http://c00lbookmarks.com/story.php?title=best-door-blinds-inside HTTP/1.1" 302 0 "-" "RSS Scout 0.9.2" 00001000 0.139.100.213_2_2 - - [02/Feb/2012:17:06:32 +0000] "GET http://wayback.archive.org/web/*/http://c00lbookmarks.com/story.php?title=best-door-blinds-inside HTTP/1.1" 404 2409 "http://web.archive.org/web/*/http://c00lbookmarks.com/story.php?title=best-door-blinds-inside" "RSS Scout 0.9.2" 00001000 0.139.100.213_2_2 - - [02/Feb/2012:17:07:38 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2" 00011000 0.139.100.213_2_2 - - [02/Feb/2012:17:10:44 +0000] "GET http://web.archive.org/web/*/http://www.goloco.org/users/D5EWwXI HTTP/1.1" 302 0 "-" "RSS Scout 0.9.2" 00001000 0.139.100.213_2_2 - - [02/Feb/2012:17:10:45 +0000] "GET http://wayback.archive.org/web/*/http://www.goloco.org/users/D5EWwXI HTTP/1.1" 404 2385 "http://web.archive.org/web/*/http://www.goloco.org/users/D5EWwXI" "RSS Scout 0.9.2" 00001000 0.139.100.213_2_2 - - [02/Feb/2012:17:14:50 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2" 00011000 0.139.100.213_2_2 - - [02/Feb/2012:17:19:54 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2" 00011000 0.139.100.213_2_2 - - [02/Feb/2012:17:27:26 +0000] "GET http://wayback.archive.org/web/*/http://epicbookmarks.com/story.php?title=door-blinds-inside-installation HTTP/1.1" 404 2416 "http://web.archive.org/web/*/http://epicbookmarks.com/story.php?title=door-blinds-inside-installation" "RSS Scout 0.9.2" 00001000 0.139.100.213_2_2 - - [02/Feb/2012:17:27:26 +0000] "GET http://web.archive.org/web/*/http://epicbookmarks.com/story.php?title=door-blinds-inside-installation HTTP/1.1" 302 0 "-" "RSS Scout 0.9.2" 00001000 20
  • 21. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Heuristic: Image to HTML ratio ● Image-to-HTML, is the ratio between the number of image files and the number of HTML files per session. ● Robots tend to retrieve only HTML pages (ignoring images and other embedded resources). Therefore human sessions should have more images than robot sessions. ● We flagged a session with less than one image file for every 10 HTML files as a robot session. 21 http://web.archive.org/web/20220512060725/https://www.odu.edu/ Downloaded using cURL Accessed in the Web Browser
  • 22. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Heuristic: Browsing Speed ● We considered a browsing speed >= 0.5 (requests per second) as a threshold to detect robot sessions. 0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190205174131/https://connect.facebook.net/signals/config/225699104785488?v=2.8.40&r=stable HTTP/1.1" 200 0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207001831/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 200 0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207001831/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 200 0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207001831/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 200 0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207001831/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 200 0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004025/https://connect.facebook.net/en_US/fbevents.js HTTP/1.1" 200 0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004025/https://connect.facebook.net/signals/config/225699104785488?v=2.8.40&r=stable HTTP/1.1" 302 0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004026/https://embed.tawk.to/59cc85aec28eca75e4622ccd/default HTTP/1.1" 200 0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004026/https://embed.tawk.to/59cc85aec28eca75e4622ccd/default HTTP/1.1" 200 0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004026/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 302 . . . . . . 22
  • 23. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Results of Applying the Heuristics Separately to Detect Bots 23 Heuristics IA 2012 IA 2019 PT 2019 Sessions: 1.53M Requests: 22.3M Sessions: 2.7M Requests: 42.9M Sessions: 3.7k Requests: 614k Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0% #UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4% Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7% Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0% Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0% Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0% The number of requests/sessions which had been labeled as robots from each heuristic separately
  • 24. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Image-to-HTML Ratio Had the Largest Effect on Detecting Robots 24 Heuristics IA 2012 IA 2019 PT 2019 Sessions: 1.53M Requests: 22.3M Sessions: 2.7M Requests: 42.9M Sessions: 3.7k Requests: 614k Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0% #UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4% Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7% Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0% Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0% Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0% Image-to-HTML ratio had the largest effect on detecting robots
  • 25. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Total Number of Detected Bots After Applying All the Heuristics Together 25 Heuristics IA 2012 IA 2019 PT 2019 Sessions: 1.53M Requests: 22.3M Sessions: 2.7M Requests: 42.9M Sessions: 3.7k Requests: 614k Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0% #UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4% Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7% Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0% Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0% Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0% The number of requests/sessions detected after applying all the heuristics together.
  • 26. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Potential Reasons for the Increase in Human Sessions in 2019 Than in 2012 26 Heuristics IA 2012 IA 2019 PT 2019 Sessions: 1.53M Requests: 22.3M Sessions: 2.7M Requests: 42.9M Sessions: 3.7k Requests: 614k Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0% #UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4% Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7% Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0% Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0% Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0% The number of requests/sessions detected after applying all the heuristics together. Increase in awareness of web archives among human users in recent years
  • 27. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL 27 Heuristics IA 2012 IA 2019 PT 2019 Sessions: 1.53M Requests: 22.3M Sessions: 2.7M Requests: 42.9M Sessions: 3.7k Requests: 614k Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0% #UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4% Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7% Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0% Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0% Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0% The number of requests/sessions detected after applying all the heuristics together. Increase in popularity of headless browsers set up by, Headless Chromium, PhantomJS, Selenium, and Puppeteer in recent years Potential Reasons for the Increase in Human Sessions in 2019 Than in 2012
  • 28. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL In IA2012, the Robots Were Almost Exclusively Limited To Dip and Skim 28
  • 29. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL The Majority of Requests Are for Mementos Around the Time Each Access Log Sample Was Taken. 29
  • 30. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Out of requests, IA 2012: 91% IA 2019: 70% PT 2019: 98% Out of sessions, IA 2012: 88% IA 2019: 70% PT 2019: 97% Key Takeaways In IA2012, the robots were almost exclusively limited to Dip and Skim, but that in IA2019, they exhibit all of the patterns and their combinations. 30 Dataset/ Feature AlNoamany et al. Our Study Sample Duration 30 minute 24 hrs Web Archives Internet Archive Internet Archive & Arquivo.pt Access Log Year 2012 2012 & 2019 Majority of the requests are for mementos that are close to the datetime of each log sample. The percentage of web archive accesses that were detected as robots. Himarsha R. Jayanetti hjaya002@odu.edu @HimarshaJ
  • 31. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL Backup slides … 31
  • 32. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL An Overview of Our Approach and the Steps Followed 32