To identify robots and humans and analyze their respective access patterns, we used the Internet Archive's (IA) Wayback Machine access logs from 2012 and 2019, as well as Arquivo.pt's (Portuguese Web Archive) access logs from 2019. We identified user sessions in the access logs and classified those sessions as human or robot based on their browsing behavior. To better understand how users navigate through the web archives, we evaluated these sessions to discover user access patterns. Based on the two archives (IA vs. Arquivo.pt) and between the two years of IA access logs (2012 vs. 2019), we present a comparison of detected robots vs. humans, user access patterns, and temporal preference. The total number of robots detected in IA 2012 is greater than IA 2019 (21% more in requests and 18% more in sessions). Robots account for 98% of requests (97% of sessions) in Arquivo.pt (2019). We found out that the robots are almost entirely limited to “Dip” and “Skim” access patterns in IA 2012, but exhibit all the patterns and their combinations in IA 2019. Both humans and robots show a preference for web pages archived in the near past.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
Robots Still Outnumber Humans in Web Archives, But Less Than Before
1. Robots Still Outnumber Humans in Web
Archives, But Less Than Before
Himarsha R. Jayanetti1
, Kritika Garg1
, Sawood Alam2
, Michael L. Nelson1
, and Michele C. Weigle1
1
Web Science & Digital Libraries Research Group
Old Dominion University, Norfolk VA, USA
@WebSciDL
2
Wayback Machine, Internet Archive
San Francisco, California, USA
@internetarchive
Presented By:
Himarsha R. Jayanetti
Department of Computer Science
Old Dominion University, Norfolk, Virginia
@HimarshaJ @WebSciDL @oducs
TPDL ‘22, The 26th International Conference on Theory and Practice of Digital Libraries, Padua, Italy, 20 - 23 September 2022
2. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
A Scenario in Which a Human Accesses Web Archives
2
https://web.archive.org/web/20120313134227/http://www.li
b.odu.edu/exhibits/odu75thanniversary/norfolkdivision.htm
https://en.wikipedia.org/wiki/Old_Dominion_University#References
3. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Another Scenario Is When Bot Services Query Web Archives
3
TimeMap
Visualization
Tool
(TMVis)
https://github.com
/oduwsdl/tmvis
4. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
4
Slider
GIF
TMViz Visualizes How Individual Webpages Have Changed Over Time
https://web.archive.org/web/2022000
0000000*/http://4genderjustice.org/
TimeMap
5. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Our Study Is an Extension of a Previous Study
Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson, “Access Patterns for Robots and Humans in Web Archives,” In Proceedings of the
ACM/IEEE Joint Conference on Digital Libraries (JCDL). Indianapolis, IN, July 2013, pp. 339-–348. https://doi.org/10.1145/2467696.2467722
5
● Robots outnumber humans:
○ 10:1 (sessions)
○ 5:4 (raw HTTP accesses)
○ 4:1 (megabytes transferred)
● Robots almost always access TimeMaps, but humans access
the mementos.
● No overall preference for mementos of a particular time, but
the recent past (within the last year) shows significant repeat
accesses.
● Proposed access patterns of web archive users:
○ Dip
○ Slide
○ Dive
○ Skim
6. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
6
Access Patterns of Web Archive Users: Dip and Dive
Original Resource
(URI-R1
)
URI-R2
URI-R3
7. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
7
Access Patterns of Web Archive Users: Slide and Skim
8. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
We Used Three Full Day Access Log Datasets
Full day sample of access logs:
● IA 2012
Internet Archive - 2012
(February 2, 2012)
● IA 2019
Internet Archive - 2019
(February 7, 2019)
● PT 2019
Arquivo.pt - 2019
(February 7, 2019)
Feature IA 2012 IA 2019 PT 2019
No. of Requests 99,173,542 308,194,916 1,046,855
GET 98.80% 98.68% 97.92%
HEAD 1.12% 0.84% 1.37%
Status Code 2xx 32.73% 48.26% 26.03%
Status Code 3xx 52.57% 42.74% 20.22%
Status Code 4xx 11.71% 8.79% 53.58%
Status Code 5xx 2.99% 0.20% 0.17%
Embedded
Resources
43.62% 63.36% 19.68%
SI Bot 0.01% 0.15% 0.34%
8
https://web.archive.org/
https://arquivo.pt/
9. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Full day sample of access logs:
● IA 2012
Internet Archive - 2012
(February 2, 2012)
● IA 2019
Internet Archive - 2019
(February 7, 2019)
● PT 2019
Arquivo.pt - 2019
(February 7, 2019)
Feature IA 2012 IA 2019 PT 2019
No. of Requests 99,173,542 308,194,916 1,046,855
GET 98.80% 98.68% 97.92%
HEAD 1.12% 0.84% 1.37%
Status Code 2xx 32.73% 48.26% 26.03%
Status Code 3xx 52.57% 42.74% 20.22%
Status Code 4xx 11.71% 8.79% 53.58%
Status Code 5xx 2.99% 0.20% 0.17%
Embedded
Resources
43.62% 63.36% 19.68%
SI Bot 0.01% 0.15% 0.34%
9
https://web.archive.org/
https://arquivo.pt/
We Used Three Full Day Access Log Datasets
10. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Full day sample of access logs:
● IA 2012
Internet Archive - 2012
(February 2, 2012)
● IA 2019
Internet Archive - 2019
(February 7, 2019)
● PT 2019
Arquivo.pt - 2019
(February 7, 2019)
Feature IA 2012 IA 2019 PT 2019
No. of Requests 99,173,542 308,194,916 1,046,855
GET 98.80% 98.68% 97.92%
HEAD 1.12% 0.84% 1.37%
Status Code 2xx 32.73% 48.26% 26.03%
Status Code 3xx 52.57% 42.74% 20.22%
Status Code 4xx 11.71% 8.79% 53.58%
Status Code 5xx 2.99% 0.20% 0.17%
Embedded
Resources
43.62% 63.36% 19.68%
SI Bot 0.01% 0.15% 0.34%
10
https://web.archive.org/
https://arquivo.pt/
We Used Three Full Day Access Log Datasets
11. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Full day sample of access logs:
● IA 2012
Internet Archive - 2012
(February 2, 2012)
● IA 2019
Internet Archive - 2019
(February 7, 2019)
● PT 2019
Arquivo.pt - 2019
(February 7, 2019)
Feature IA 2012 IA 2019 PT 2019
No. of Requests 99,173,542 308,194,916 1,046,855
GET 98.80% 98.68% 97.92%
HEAD 1.12% 0.84% 1.37%
Status Code 2xx 32.73% 48.26% 26.03%
Status Code 3xx 52.57% 42.74% 20.22%
Status Code 4xx 11.71% 8.79% 53.58%
Status Code 5xx 2.99% 0.20% 0.17%
Embedded
Resources
43.62% 63.36% 19.68%
SI Bot 0.01% 0.15% 0.34%
11
https://web.archive.org/
https://arquivo.pt/
We Used Three Full Day Access Log Datasets
12. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Full day sample of access logs:
● IA 2012
Internet Archive - 2012
(February 2, 2012)
● IA 2019
Internet Archive - 2019
(February 7, 2019)
● PT 2019
Arquivo.pt - 2019
(February 7, 2019)
Feature IA 2012 IA 2019 PT 2019
No. of Requests 99,173,542 308,194,916 1,046,855
GET 98.80% 98.68% 97.92%
HEAD 1.12% 0.84% 1.37%
Status Code 2xx 32.73% 48.26% 26.03%
Status Code 3xx 52.57% 42.74% 20.22%
Status Code 4xx 11.71% 8.79% 53.58%
Status Code 5xx 2.99% 0.20% 0.17%
Embedded
Resources
43.62% 63.36% 19.68%
SI Bot 0.01% 0.15% 0.34%
12
https://web.archive.org/
https://arquivo.pt/
We Used Three Full Day Access Log Datasets
13. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Full day sample of access logs:
● IA 2012
Internet Archive - 2012
(February 2, 2012)
● IA 2019
Internet Archive - 2019
(February 7, 2019)
● PT 2019
Arquivo.pt - 2019
(February 7, 2019)
Feature IA 2012 IA 2019 PT 2019
No. of Requests 99,173,542 308,194,916 1,046,855
GET 98.80% 98.68% 97.92%
HEAD 1.12% 0.84% 1.37%
Status Code 2xx 32.73% 48.26% 26.03%
Status Code 3xx 52.57% 42.74% 20.22%
Status Code 4xx 11.71% 8.79% 53.58%
Status Code 5xx 2.99% 0.20% 0.17%
Embedded
Resources
43.62% 63.36% 19.68%
SI Bot 0.01% 0.15% 0.34%
13
https://web.archive.org/
https://arquivo.pt/
We Used Three Full Day Access Log Datasets
14. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
We Followed a Two-Step Data Cleaning Process
Stage 1
● Remove log entries that were either invalid or
irrelevant to the analysis.
○ Everything except requests to Mementos
○ Everything except requests to TimeMaps
○ Kept the requests to the robots.txt of the
web archive.
Dataset Before
(No. of Requests)
After Cleaning (No. of Requests)
Stage 1 Stage 2
IA 2012 99,173,542 85.22% 18.58%
IA 2019 308,194,916 77.19% 11.36%
PT 2019 1,046,855 86.40% 57.77%
14
Stage 2
● Remove log entries that were irrelevant in
terms of user behavior.
○ Everything except GET requests
○ Everything except 200, 404, and 503
response codes
○ Embedded resources
15. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
We Divided the Access Logs Into Different Sessions
○ Grouped the requests based on the IP and User-Agent.
○ Divided the requests of each user into individual sessions (timeout threshold: 10 minute)
15
1.1.0.100_0_1 - - [02/Feb/2012:04:36:43 +0000] "GET http://web.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0
(Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_1 - - [02/Feb/2012:04:36:44 +0000] "GET http://wayback.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 404 2164 "-"
"Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:04:58:34 +0000] "GET http://web.archive.org/web/*/LETSMILK.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0
(Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:04:58:37 +0000] "GET http://wayback.archive.org/web/*/LETSMILK.RU HTTP/1.0" 404 2162 "-"
"Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:05:00:21 +0000] "GET http://web.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0
(Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:05:00:49 +0000] "GET http://wayback.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 503 2197 "-"
"Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
The IP addresses
are anonymized.
16. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
16
1.1.0.100_0_1 - - [02/Feb/2012:04:36:43 +0000] "GET http://web.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0
(Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_1 - - [02/Feb/2012:04:36:44 +0000] "GET http://wayback.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 404 2164 "-"
"Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:04:58:34 +0000] "GET http://web.archive.org/web/*/LETSMILK.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0
(Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:04:58:37 +0000] "GET http://wayback.archive.org/web/*/LETSMILK.RU HTTP/1.0" 404 2162 "-"
"Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:05:00:21 +0000] "GET http://web.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0
(Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:05:00:49 +0000] "GET http://wayback.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 503 2197 "-"
"Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
The duration between
the two requests
> 10 Minutes
○ Grouped the requests based on the IP and User-Agent.
○ Divided the requests of each user into individual sessions (timeout threshold: 10 minute)
We Divided the Access Logs Into Different Sessions
17. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Heuristic: The Type of Request (HEAD Requests)
199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
● Web browsers issue GET requests for web pages.
● We flagged the requests making HEAD requests as bots.
17
18. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Heuristic: List of Known Bots
● A manually compiled User-Agent list of known bots.
● User-Agents with keywords such as bot, crawler, spider, etc.
● Python module "DeviceDetector", which is a User-Agent parser which will help us determine
whether or not the User-Agent is a bot.
199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
https://pypi.org/project/device-detector/ (DeviceDetector, User-Agent Parser)
https://github.com/oduwsdl/access-patterns/blob/main/Known_Bot_List/knownbot.tsv (Final Known Bot List)
18
19. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Heuristic: Number of User-Agents per IP
x0.77.87.100 - - [02/Feb/2012:03:46:54 +0000] "POST http://web.archive.org/web/20070211155651/http://212.227.83.57/cproc.aspx HTTP/1.0" 302 0
"http://www.vbleisure.co.uk/guest_book.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0)"
00101000
x0.77.87.100 - - [02/Feb/2012:04:06:29 +0000] "POST http://web.archive.org/web/20070211155651/http://212.227.83.57/cproc.aspx HTTP/1.0" 302 -
"http://www.vbleisure.co.uk/guest_book.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
00101000
x0.77.87.100 - - [02/Feb/2012:05:09:30 +0000] "POST http://web.archive.org/web/20070211155651/http://212.227.83.57/cproc.aspx HTTP/1.0" 302 -
"http://www.vbleisure.co.uk/guest_book.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
00101000
x0.77.87.100 - - [02/Feb/2012:07:59:43 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php
HTTP/1.0" 302 0 "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; ru) Opera 8.50" 00101000
. . .
. . .
. . .
x0.77.87.100 - - [02/Feb/2012:22:04:57 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php
HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Creative)"
00101000
x0.77.87.100 - - [02/Feb/2012:22:08:02 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php
HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)"
00101000
x0.77.87.100 - - [02/Feb/2012:23:40:31 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php
HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.0" 00101000
x0.77.87.100 - - [02/Feb/2012:23:40:32 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php
HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; MRA 4.6 (build 01425))" 00101000
x0.77.87.100 - - [02/Feb/2012:23:59:34 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php
HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Opera/7.60 (Windows NT 5.2; U) [en] (IBM EVV/3.0/EAK01AG9/LE)" 00101000
● Some of the bots keep changing their User-Agent between requests to avoid being detected as a bot.
● We have flagged requests from IPs that update their User-Agent field more than 20 times as bots.
19
20. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Heuristic: Requests to robots.txt File
● Legitimate bots will typically request robots.txt to determine what they are allowed to crawl.
● We considered a request for the robots.txt file as an indication for a bot request.
0.139.100.213_2_2 - - [02/Feb/2012:17:03:22 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2"
00011000
0.139.100.213_2_2 - - [02/Feb/2012:17:06:30 +0000] "GET http://web.archive.org/web/*/http://c00lbookmarks.com/story.php?title=best-door-blinds-inside
HTTP/1.1" 302 0 "-" "RSS Scout 0.9.2" 00001000
0.139.100.213_2_2 - - [02/Feb/2012:17:06:32 +0000] "GET http://wayback.archive.org/web/*/http://c00lbookmarks.com/story.php?title=best-door-blinds-inside
HTTP/1.1" 404 2409 "http://web.archive.org/web/*/http://c00lbookmarks.com/story.php?title=best-door-blinds-inside" "RSS Scout 0.9.2"
00001000
0.139.100.213_2_2 - - [02/Feb/2012:17:07:38 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2"
00011000
0.139.100.213_2_2 - - [02/Feb/2012:17:10:44 +0000] "GET http://web.archive.org/web/*/http://www.goloco.org/users/D5EWwXI HTTP/1.1" 302 0 "-" "RSS Scout
0.9.2" 00001000
0.139.100.213_2_2 - - [02/Feb/2012:17:10:45 +0000] "GET http://wayback.archive.org/web/*/http://www.goloco.org/users/D5EWwXI HTTP/1.1" 404 2385
"http://web.archive.org/web/*/http://www.goloco.org/users/D5EWwXI" "RSS Scout 0.9.2" 00001000
0.139.100.213_2_2 - - [02/Feb/2012:17:14:50 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2"
00011000
0.139.100.213_2_2 - - [02/Feb/2012:17:19:54 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2"
00011000
0.139.100.213_2_2 - - [02/Feb/2012:17:27:26 +0000] "GET
http://wayback.archive.org/web/*/http://epicbookmarks.com/story.php?title=door-blinds-inside-installation HTTP/1.1" 404 2416
"http://web.archive.org/web/*/http://epicbookmarks.com/story.php?title=door-blinds-inside-installation" "RSS Scout 0.9.2"
00001000
0.139.100.213_2_2 - - [02/Feb/2012:17:27:26 +0000] "GET
http://web.archive.org/web/*/http://epicbookmarks.com/story.php?title=door-blinds-inside-installation HTTP/1.1" 302 0 "-" "RSS Scout 0.9.2"
00001000
20
21. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Heuristic: Image to HTML ratio
● Image-to-HTML, is the ratio between the number of image files and the number of HTML files per session.
● Robots tend to retrieve only HTML pages (ignoring images and other embedded resources). Therefore human
sessions should have more images than robot sessions.
● We flagged a session with less than one image file for every 10 HTML files as a robot session.
21
http://web.archive.org/web/20220512060725/https://www.odu.edu/
Downloaded
using cURL
Accessed in the
Web Browser
23. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Results of Applying the Heuristics Separately to Detect Bots
23
Heuristics
IA 2012 IA 2019 PT 2019
Sessions:
1.53M
Requests:
22.3M
Sessions:
2.7M
Requests:
42.9M
Sessions:
3.7k
Requests:
614k
Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0%
#UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4%
Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7%
Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0%
Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0%
Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0%
The number of requests/sessions which had been
labeled as robots from each heuristic separately
24. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Image-to-HTML Ratio Had the Largest Effect on
Detecting Robots
24
Heuristics
IA 2012 IA 2019 PT 2019
Sessions:
1.53M
Requests:
22.3M
Sessions:
2.7M
Requests:
42.9M
Sessions:
3.7k
Requests:
614k
Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0%
#UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4%
Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7%
Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0%
Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0%
Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0%
Image-to-HTML ratio had the largest effect on detecting robots
25. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Total Number of Detected Bots After Applying All the
Heuristics Together
25
Heuristics
IA 2012 IA 2019 PT 2019
Sessions:
1.53M
Requests:
22.3M
Sessions:
2.7M
Requests:
42.9M
Sessions:
3.7k
Requests:
614k
Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0%
#UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4%
Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7%
Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0%
Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0%
Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0%
The number of requests/sessions detected after applying all the heuristics together.
26. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Potential Reasons for the Increase in Human Sessions
in 2019 Than in 2012
26
Heuristics
IA 2012 IA 2019 PT 2019
Sessions:
1.53M
Requests:
22.3M
Sessions:
2.7M
Requests:
42.9M
Sessions:
3.7k
Requests:
614k
Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0%
#UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4%
Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7%
Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0%
Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0%
Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0%
The number of requests/sessions detected after applying all the heuristics together.
Increase in awareness of web archives among human users in recent years
27. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
27
Heuristics
IA 2012 IA 2019 PT 2019
Sessions:
1.53M
Requests:
22.3M
Sessions:
2.7M
Requests:
42.9M
Sessions:
3.7k
Requests:
614k
Known Bots 1.0% 1.0% 12.0% 12.0% 24.0% 11.0%
#UA per IP 0.3% 3.0% 0.2% 3.4% 0.1% 0.4%
Robots.txt 0.1% 0.1% 0.4% 0.1% 11.0% 0.7%
Image to HTML ratio 87.0% 89.0% 66.0% 56.0% 79.0% 96.0%
Browsing Speed 16.0% 20.0% 19.0% 49.0% 46.0% 26.0%
Total Robots 88.0% 91.0% 70.0% 70.0% 97.0% 98.0%
The number of requests/sessions detected after applying all the heuristics together.
Increase in popularity of headless browsers set up by, Headless Chromium,
PhantomJS, Selenium, and Puppeteer in recent years
Potential Reasons for the Increase in Human Sessions
in 2019 Than in 2012
28. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
In IA2012, the Robots Were Almost
Exclusively Limited To Dip and Skim
28
29. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
The Majority of Requests Are for Mementos Around
the Time Each Access Log Sample Was Taken.
29
30. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Out of requests,
IA 2012: 91%
IA 2019: 70%
PT 2019: 98%
Out of sessions,
IA 2012: 88%
IA 2019: 70%
PT 2019: 97%
Key Takeaways
In IA2012, the robots were
almost exclusively limited to Dip
and Skim, but that in IA2019,
they exhibit all of the patterns
and their combinations.
30
Dataset/
Feature
AlNoamany et al. Our Study
Sample
Duration
30 minute 24 hrs
Web Archives Internet Archive Internet Archive &
Arquivo.pt
Access Log
Year
2012 2012 & 2019
Majority of the
requests are for
mementos that
are close to the
datetime of each
log sample.
The percentage
of web archive
accesses that
were detected
as robots.
Himarsha R. Jayanetti
hjaya002@odu.edu
@HimarshaJ
31. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
Backup slides …
31
32. Robots Still Outnumber Humans in Web Archives, But Less Than Before, TPDL ‘22 Padua, Italy. @HimarshaJ @WebSciDL
An Overview of Our Approach and the Steps Followed
32