Access Patterns for Robots and Humans in Web Archives

Access Patterns for Robots
and Humans in Web Archives
Yasmin AlNoamany, Michele C. Weigle, Michael L. Nelson
Computer Science Department
Old Dominion University, Norfolk, VA
yasmin@cs.odu.edu
Access Patterns for Robots and Humans in Web Archives

Access Patterns for Robots and Humans in Web Archives 2
0.204.48.255 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/ahrefs.comHTTP/1.0"200 96037 "-" "Mozilla/5.0(Windows NT 6.1; rv:10.0) Gecko/20100101
Firefox/10.0"
0.241.150.135 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/hperlinknow.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1)
Gecko/20100101Firefox/9.0.1"
0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20120202063732/http://b.scorecardresearch.com/beacon.jsHTTP/1.1"403 127
"http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77
Safari/535.7"
0.62.96.215 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://carbolicsmokeall.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
rv:1.9.2.26)Gecko/20120128BTRS87692Firefox/3.6.26( .NET CLR 3.5.30729; .NET4.0E)"
0.55.251.218 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/image/15_bar.gifHTTP/1.1"302 0
"http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/alltop.html""Mozilla/5.0(compatible;MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
0.123.255.46 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20090321190441im_/http://www.chermside.com/wp-content/uploads/bowlsclub.gifHTTP/1.1"302 0
"http://web.archive.org/web/20090321190441/http://www.chermside.com/""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77Safari/535.7"
0.73.170.52 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://www.pornhub.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2)
AppleWebKit/534.51.22(KHTML, like Gecko) Version/5.1.1Safari/534.51.22"
0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/bot_1.gifHTTP/1.1"302 0
"http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko)
Chrome/17.0.963.46Safari/535.11"
0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://liveweb.archive.org/http://photos.modelmayhem.com/avatars/1/9/3/2/6/7/4f1e2fb2e4ed4_t.jpgHTTP/1.1"200 7682
Safari/535.7"
0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/sub_right.gifHTTP/1.1"302 0
Chrome/17.0.963.46Safari/535.11"
0.227.26.32 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.friendscafe.orgHTTP/1.1"200 102279 "-" "Mozilla/5.0 (Windows NT 5.1; rv:9.0.1)
0.29.194.93 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://tw.18dao.netHTTP/1.1"200 96951 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW;
rv:1.9.2.25)Gecko/20111212 AlexaToolbar/alxf-2.13Firefox/3.6.25( .NET CLR 3.5.30729)"
0.90.22.18 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.bookingbug.comHTTP/1.1"200 104622 "-" "Mozilla/5.0 (Windows; U; Windows NT
6.1; en-US; rv:1.9.2.18)Gecko/20110614Firefox/3.6.18"
0.7.73.16 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.orghttp://web.archive.org/web/20070930062203/http://profiles.yahoo.com/powertrip_02HTTP/1.1"302 0 "-"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.15)Gecko/20110303Firefox/3.6.15& vbCrlfAccept:text/javascript,image/gif,image/x-xbitmap,image/jpeg,image/pjpeg,
application/x-shockwave-flash,application/vnd.ms-excel,applicati"
0.49.73.161 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20061230183944im_/http://www.i3dthemes.com/_images/icons/rss_small.jpgHTTP/1.1"302 0 "-"
"Mozilla/5.0"
…

Access Patterns for Robots and Humans in Web Archives 3
0.204.48.255 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/ahrefs.comHTTP/1.0"200 96037 "-" "Mozilla/5.0(Windows NT 6.1; rv:10.0) Gecko/20100101
Firefox/10.0"
0.241.150.135 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/hperlinknow.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1)
0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20120202063732/http://b.scorecardresearch.com/beacon.jsHTTP/1.1"403 127
Safari/535.7"
0.62.96.215 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://carbolicsmokeall.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
rv:1.9.2.26)Gecko/20120128BTRS87692Firefox/3.6.26( .NET CLR 3.5.30729; .NET4.0E)"
0.55.251.218 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/image/15_bar.gifHTTP/1.1"302 0
"http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/alltop.html""Mozilla/5.0(compatible;MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
0.123.255.46 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20090321190441im_/http://www.chermside.com/wp-content/uploads/bowlsclub.gifHTTP/1.1"302 0
"http://web.archive.org/web/20090321190441/http://www.chermside.com/""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77Safari/535.7"
0.73.170.52 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://www.pornhub.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2)
AppleWebKit/534.51.22(KHTML, like Gecko) Version/5.1.1Safari/534.51.22"
0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/bot_1.gifHTTP/1.1"302 0
Chrome/17.0.963.46Safari/535.11"
0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://liveweb.archive.org/http://photos.modelmayhem.com/avatars/1/9/3/2/6/7/4f1e2fb2e4ed4_t.jpgHTTP/1.1"200 7682
Safari/535.7"
0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/sub_right.gifHTTP/1.1"302 0
Chrome/17.0.963.46Safari/535.11"
0.227.26.32 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.friendscafe.orgHTTP/1.1"200 102279 "-" "Mozilla/5.0 (Windows NT 5.1; rv:9.0.1)
0.29.194.93 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://tw.18dao.netHTTP/1.1"200 96951 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW;
rv:1.9.2.25)Gecko/20111212 AlexaToolbar/alxf-2.13Firefox/3.6.25( .NET CLR 3.5.30729)"
0.90.22.18 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.bookingbug.comHTTP/1.1"200 104622 "-" "Mozilla/5.0 (Windows; U; Windows NT
6.1; en-US; rv:1.9.2.18)Gecko/20110614Firefox/3.6.18"
0.7.73.16 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.orghttp://web.archive.org/web/20070930062203/http://profiles.yahoo.com/powertrip_02HTTP/1.1"302 0 "-"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.15)Gecko/20110303Firefox/3.6.15& vbCrlfAccept:text/javascript,image/gif,image/x-xbitmap,image/jpeg,image/pjpeg,
application/x-shockwave-flash,application/vnd.ms-excel,applicati"
0.49.73.161 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20061230183944im_/http://www.i3dthemes.com/_images/icons/rss_small.jpgHTTP/1.1"302 0 "-"
"Mozilla/5.0"
…

Motivation
• There have been many studies for web access
patterns
• This is the first study using Internet Archive’s
web server logs to discover how users access
web archives
4

Research Question
• How do users, both humans and robots,
access web archives?
5

Methodology
6

Sample of Wayback Machine
access logs
7
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
0.247.222.86 - - [02/Feb/2012:07:03:55 +0000] "GET
http://web.archive.org/web/20130318135600/http://www.cnn.com HTTP/1.1"
200 18875 "http://wayback.archive.org/web/*/http://www.aura.vu"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7
(KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7"}

access logs
8
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86

access logs
9
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
IPs had been anonymized by Internet Archive

access logs
10
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000

access logs
11
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET

access logs
12
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• URI: http://wayback.archive.org/web/*/http://www.cnn.com

access logs
13
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
TimeMap

access logs
14
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• URI: http://wayback.archive.org/web/20130318135600/http://www.cnn.com
0.247.222.86 - - [02/Feb/2012:07:03:55 +0000] "GET
http://web.archive.org/web/20130318135600/http://www.cnn.com/
HTTP/1.1" 200 18875
"http://wayback.archive.org/web/*/http://www.cnn.com"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77
Safari/535.7"} Memento
TimeMap

access logs
15
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• Protocol: HTTP/1.1

access logs
16
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP status code: 200

access logs
17
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• Bytes sent: 96433

access logs
18
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• Referring URI: http://www.archive.org/web/web.php

access logs
19
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• Referring URI: http://www.archive.org/web/web.php
• User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7
(KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7

Dataset
• More than 82 million requests per day come
to the Wayback Machine
• Cluster Sampling: a week, Feb. 2-8, 2012
• Random Sampling: random slice (2 million
requests) from each day of the week
• We looked at all these days and found that 2
Feb. is a representative sample
– For details, look at Section 4.2 and Table 3 in the
paper
20

Pre-processing
• Data Cleaning
• Session Identification
• Robot Detection
21

Data Cleaning
22
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
http://web.archive.org/web/20070519015308/
http://www.jcdl.org/

Embedded Resources
23
0.11.160.135 [02/Feb/2012:00:01:03] "GET
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"

Embedded Resources
24
0.11.160.135 [02/Feb/2012:00:01:03] "GET
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"

Static Resources
25
0.11.160.135 [02/Feb/2012:00:01:03] "GET
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"

Static Resources
26
0.11.160.135 [02/Feb/2012:00:01:03] "GET
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"

Invalid requests
27
about:blank
0.11.160.135 [02/Feb/2012:00:01:03] "GET
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"

Invalid requests
28
0.11.160.135 [02/Feb/2012:00:01:03] "GET
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
about:blank

Requests that had 3xx status code
29
http://www.jcdl.org/0.11.160.135 [02/Feb/2012:00:01:03] "GET
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"

30
0.11.160.135 [02/Feb/2012:00:01:03] "GET
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
curl -I "http://web.archive.org/web/20140004100000/http://www.jcdl.org/"
HTTP/1.1 302 Moved Temporarily
Server: Tengine/1.4.3
Date: Tue, 02 Jul 2013 19:48:59 GMT
Content-Type: application/octet-stream
Content-Length: 0
Connection: keep-alive
set-cookie: wayback_server=10; Domain=archive.org; Path=/; Expires=Thu, 01-Aug-13 19:48:59 GMT;
Location: /web/20130114160045/http://www.jcdl.org/

31
0.11.160.135 [02/Feb/2012:00:01:03] "GET
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"

Session: set of web pages requested
by a particular user
32
1 mins 4 mins
3 mins 9 mins
p1 p2 p3
p4 p5

Session: set of web pages requested
by a particular user
33
1 mins 4 mins
3 mins 9 mins
p1 p2 p3
p4 p5
Time between two
requests ≤ 10

Session Identification
• Grouping: based on the IP and User-
Agent
• Threshold timeout: 10 minutes Liu et al. 2007,
Spiliopoulou et al. 2003
34

Robot Detection is a big challenge
35
I’m not a
robot

Distinguishing Robots from
Humans
36

User-Agent Check
0.182.141.149 - -
[02/Feb/2012:00:01:51 +0000] "GET
http://wayback.archive.org/web/199906
01000000*/http://www.belizefirst.com/
HTTP/1.0" 200 98507 "-"
"Python-urllib/1.17"
37

Number of User-Agent per IP
38

Number of User-Agent per IP
39
One IP with User-Agent ≥20 = lying Robot

Robots.txt file
• Session that contains an access for robot.txt is
a robot
40
0.182.141.149 - - [02/Feb/2012:06:20:46 +0000] "GET
http://web.archive.org/robots.txt HTTP/1.0" 200 125 "-"
"Mozilla/5.0 (compatible; MJ12bot/v1.4.1;
http://www.majestic12.co.uk/bot.php?+)"
0.182.141.149 - - [02/Feb/2012:06:20:19 +0000] "GET
http://wayback.archive.org/web/*/http://www.devilscafe.in
HTTP/1.1" 404 2168 "-" "Mozilla/5.0 (compatible;
MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)"
0.182.141.149 - - [02/Feb/2012:06:21:19 +0000] "GET
http://wayback.archive.org/web/*/http://www.genie.co.il
HTTP/1.1" 200 96205 "-" "Mozilla/5.0 (compatible;
MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)"

6 requests, 2 seconds  robot
41
0.182.141.149 - - [02/Feb/2012:07:00:01 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 106433 “-"
0.182.141.149 - - [02/Feb/2012:07:00:01 +0000] "GET
http://wayback.archive.org/web/*/http://www.bbc.com HTTP/1.1" 200 566433 "-"
0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET
http://wayback.archive.org/web/*/http://www.google.com HTTP/1.1" 200 96433 "-"
0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET
http://wayback.archive.org/web/*/http://www.yahoo.com HTTP/1.1" 200 933333 "-"
0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET
http://wayback.archive.org/web/*/http://www.bing.com HTTP/1.1" 200 964333 “-"
0.182.141.149 - - [02/Feb/2012:07:00:3 +0000] "GET
http://wayback.archive.org/web/*/http://www.jcdl.org HTTP/1.1" 200 123233 “-"

3 requests, 520 seconds
(9 minutes)  human
42
0.11.160.13 - - [02/Feb/2012:07:00:00 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 106433 "-"
0.11.160.13 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/20100330042821/http://www.cnn.com HTTP/1.1" 200
566433 " http://wayback.archive.org/web/*/http://www.cnn.com" "Mozilla/5.0
(Macintosh; Intel Mac OS X 10_6_8)
0.11.160.13 - - [02/Feb/2012:07:08:00 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "
http://wayback.archive.org/web/*/http://www.cnn.com" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8)

0.5 is a Good Browsing Speed Threshold
for Distinguishing Robots and Humans (Nithya
et al. 2012 , Reddy et al. 2012)
43
Browsing Speed (BS)
BS =
𝑠𝑒𝑠𝑠𝑖𝑜𝑛 𝑙𝑒𝑛𝑔𝑡ℎ
𝑠𝑒𝑠𝑠𝑖𝑜𝑛 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛
𝐵𝑆 =
≤ 0.5 𝐻𝑢𝑚𝑎𝑛𝑠
> 0.5 𝑅𝑜𝑏𝑜𝑡𝑠

Image-to-HTML Ratio
44
If I download
these, I’m
not a robot

Image-to-HTML Ratio
• The ratio between the number of image files
and the number of HTML files per session
• Robots sessions are less than 1:10 image to
HTML ratio, as suggested by Stassopoulou et al. 2005
45

Image-to-HTML is the best in
detecting robots
46

Traffic Analysis
• Records remaining after cleaning: 21.3%
(426,317 out of 2M)
• Unique IPs: 21,932
• Users: 33,841
• Sessions: 37,634
47

Robots have longer sessions
than humans
48

Humans spend more time
than Robots
49

Robots outnumber humans
in terms of:
50
Sessions
10
1
Raw HTTP
Accesses
5
4
MB
Transferred
4
1

User Access Patterns in
Web Archives
• Dip
• Dive
• Slide
• Skim
51

Dip: simple access to
TimeMap or memento
52
TimeMap Memento

Dive: different pages at approximately
the same archive time
53
November 12, 2009 11:55:54
November 12, 2009 05:37:22
November 12, 2009 05:38:02

Slide: the same page at different
archive times
54
March 18, 2013 13:56:00 November 15, 2009 05:33:01 July 31, 2006 23:55:45

Skim: lists of TimeMaps
55
http://web.archive.org/web/*/
http://cnn.com/
http://www.bbc.com/
http://www.nytimes.com/

Everybody Dips, Humans Dive,
Robots Skim
56
Robots (34,203 sessions) Humans (3,431 sessions)

Pattern Length
57
Slide length = 4
Skim length = 3

Small Medians, Large
Standard Deviations
58

Only recent past exhibits
locality of reference
59

Only recent past exhibits
locality of reference
60
Cache replacement
policies should
favor recent past

Conclusions
• We introduced traffic analysis for the Wayback Machine
• We discovered that robots outnumber humans
– 10:1 in terms of sessions
– 5:4 in terms of raw, unfiltered requests
– 4:1 in terms of megabytes transferred
– Robots need APIs http://arxiv.org/abs/1305.5959
• We Identified four major web archive access patterns
– Dip
– Slide
– Dive
– Skim
• Only recent past exhibits locality of reference
61

Extra Slides
62

The Features of the Samples
Days Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7 Feb 8 Mean SD SE
Duration 0:33:12 0:31:15 0:40:34 0:42:57 0:29:35 0:25:45 0:24:33 0:32:33 0:06:29 0:02:27
GET 98.4% 99.3% 97.7% 97.9% 99.4% 99.7% 99.8% 99% 0.8% 0.3%
Embedded 47.4% 34.8% 43.7% 42.7% 41.9% 44.7% 46.8% 43.1% 3.9% 1.5%
SI Robots 6.2% 12.0% 7.7% 7.7% 2.9% 3.5% 3.8% 6.3% 3.0% 1.1%
NullRef 42.6% 56.6% 47.5% 47.0% 49.4% 42.6% 43.9% 47.1% 4.6% 1.7%
s2xx 33.7% 32.4% 34.2% 33.2% 34.1% 33.4% 33.6% 33.5% 0.6% 0.2%
s3xx 51.8% 52.3% 50.8% 52.2% 51.7% 51.9% 53.2% 52.0% 0.7% 0.3%
s4xx 11.7% 13.1% 12.0% 11.6% 11.2% 10.3% 10.1% 11.4% 0.9% 0.4%
s5xx 2.8% 2.3% 3.0% 2.9% 3.0% 4.4% 3.1% 3.1% 0.6% 0.2%
Cleaned 21.3% 23.0% 17.6% 17.7% 20.7% 18.1% 16.9% 19.3% 2.2% 0.8%
Sessions 37,634 31,731 32,159 28,750 36,087 35,848 32,117 33,475 2,896 1,094
63

Very Small Standard Errors among
Samples
64
Duration 0:33:12 0:31:15 0:40:34 0:42:57 0:29:35 0:25:45 0:24:33 0:32:33 0:06:29 0:02:27
GET 98.4% 99.3% 97.7% 97.9% 99.4% 99.7% 99.8% 99% 0.8% 0.3%
Embedded 47.4% 34.8% 43.7% 42.7% 41.9% 44.7% 46.8% 43.1% 3.9% 1.5%
SI Robots 6.2% 12.0% 7.7% 7.7% 2.9% 3.5% 3.8% 6.3% 3.0% 1.1%
NullRef 42.6% 56.6% 47.5% 47.0% 49.4% 42.6% 43.9% 47.1% 4.6% 1.7%
s2xx 33.7% 32.4% 34.2% 33.2% 34.1% 33.4% 33.6% 33.5% 0.6% 0.2%
s3xx 51.8% 52.3% 50.8% 52.2% 51.7% 51.9% 53.2% 52.0% 0.7% 0.3%
s4xx 11.7% 13.1% 12.0% 11.6% 11.2% 10.3% 10.1% 11.4% 0.9% 0.4%
s5xx 2.8% 2.3% 3.0% 2.9% 3.0% 4.4% 3.1% 3.1% 0.6% 0.2%
Cleaned 21.3% 23.0% 17.6% 17.7% 20.7% 18.1% 16.9% 19.3% 2.2% 0.8%
Sessions 37,634 31,731 32,159 28,750 36,087 35,848 32,117 33,475 2,896 1,094

Feb. 2, 2012 sample is representative
Duration 0:33:12 0:31:15 0:40:34 0:42:57 0:29:35 0:25:45 0:24:33 0:32:33 0:06:29 0:02:27
GET 98.4% 99.3% 97.7% 97.9% 99.4% 99.7% 99.8% 99% 0.8% 0.3%
Embedded 47.4% 34.8% 43.7% 42.7% 41.9% 44.7% 46.8% 43.1% 3.9% 1.5%
SI Robots 6.2% 12.0% 7.7% 7.7% 2.9% 3.5% 3.8% 6.3% 3.0% 1.1%
NullRef 42.6% 56.6% 47.5% 47.0% 49.4% 42.6% 43.9% 47.1% 4.6% 1.7%
s2xx 33.7% 32.4% 34.2% 33.2% 34.1% 33.4% 33.6% 33.5% 0.6% 0.2%
s3xx 51.8% 52.3% 50.8% 52.2% 51.7% 51.9% 53.2% 52.0% 0.7% 0.3%
s4xx 11.7% 13.1% 12.0% 11.6% 11.2% 10.3% 10.1% 11.4% 0.9% 0.4%
s5xx 2.8% 2.3% 3.0% 2.9% 3.0% 4.4% 3.1% 3.1% 0.6% 0.2%
Cleaned 21.3% 23.0% 17.6% 17.7% 20.7% 18.1% 16.9% 19.3% 2.2% 0.8%
Sessions 37,634 31,731 32,159 28,750 36,087 35,848 32,117 33,475 2,896 1,094
65

Results of Data Cleaning
• The records remained after cleaning are 21.3%
of the requests in the raw file.
66

Robots outnumber humans
in terms of:
67
Sessions
10
1
Raw HTTP
Accesses
5
4
MB
Transferred
4
1
Users # Sessions # Requests
(Raw)
# Transferred MB
Robots 34,203
(90.9%)
1,002,573
(50.1%)
20,010
Humans 3,431
(9.10%)
810,049
(40.5%)
4,459

Humans exhibit Dip and Dive,
while robots exhibit Dip and Skim
68
Robots Humans
328 Slides
571 Dives
1167
Slides
1942
Dives

The total number of mementos available
for 2011 was similar to previous years.
69

Access Patterns for Robots and Humans in Web Archives

More Related Content

Viewers also liked

Similar to Access Patterns for Robots and Humans in Web Archives

More from Yasmin AlNoamany, PhD

Recently uploaded

Access Patterns for Robots and Humans in Web Archives