Access Patterns for Robots
and Humans in Web Archives
Yasmin AlNoamany, Michele C. Weigle, Michael L. Nelson
Computer Science Department
Old Dominion University, Norfolk, VA
yasmin@cs.odu.edu
Access Patterns for Robots and Humans in Web Archives
Access Patterns for Robots and Humans in Web Archives 2
0.204.48.255 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/ahrefs.comHTTP/1.0"200 96037 "-" "Mozilla/5.0(Windows NT 6.1; rv:10.0) Gecko/20100101
Firefox/10.0"
0.241.150.135 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/hperlinknow.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1)
Gecko/20100101Firefox/9.0.1"
0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20120202063732/http://b.scorecardresearch.com/beacon.jsHTTP/1.1"403 127
"http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77
Safari/535.7"
0.62.96.215 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://carbolicsmokeall.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
rv:1.9.2.26)Gecko/20120128BTRS87692Firefox/3.6.26( .NET CLR 3.5.30729; .NET4.0E)"
0.55.251.218 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/image/15_bar.gifHTTP/1.1"302 0
"http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/alltop.html""Mozilla/5.0(compatible;MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
0.123.255.46 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20090321190441im_/http://www.chermside.com/wp-content/uploads/bowlsclub.gifHTTP/1.1"302 0
"http://web.archive.org/web/20090321190441/http://www.chermside.com/""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77Safari/535.7"
0.73.170.52 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://www.pornhub.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2)
AppleWebKit/534.51.22(KHTML, like Gecko) Version/5.1.1Safari/534.51.22"
0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/bot_1.gifHTTP/1.1"302 0
"http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko)
Chrome/17.0.963.46Safari/535.11"
0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://liveweb.archive.org/http://photos.modelmayhem.com/avatars/1/9/3/2/6/7/4f1e2fb2e4ed4_t.jpgHTTP/1.1"200 7682
"http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77
Safari/535.7"
0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/sub_right.gifHTTP/1.1"302 0
"http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko)
Chrome/17.0.963.46Safari/535.11"
0.227.26.32 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.friendscafe.orgHTTP/1.1"200 102279 "-" "Mozilla/5.0 (Windows NT 5.1; rv:9.0.1)
Gecko/20100101Firefox/9.0.1"
0.29.194.93 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://tw.18dao.netHTTP/1.1"200 96951 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW;
rv:1.9.2.25)Gecko/20111212 AlexaToolbar/alxf-2.13Firefox/3.6.25( .NET CLR 3.5.30729)"
0.90.22.18 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.bookingbug.comHTTP/1.1"200 104622 "-" "Mozilla/5.0 (Windows; U; Windows NT
6.1; en-US; rv:1.9.2.18)Gecko/20110614Firefox/3.6.18"
0.7.73.16 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.orghttp://web.archive.org/web/20070930062203/http://profiles.yahoo.com/powertrip_02HTTP/1.1"302 0 "-"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.15)Gecko/20110303Firefox/3.6.15& vbCrlfAccept:text/javascript,image/gif,image/x-xbitmap,image/jpeg,image/pjpeg,
application/x-shockwave-flash,application/vnd.ms-excel,applicati"
0.49.73.161 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20061230183944im_/http://www.i3dthemes.com/_images/icons/rss_small.jpgHTTP/1.1"302 0 "-"
"Mozilla/5.0"
…
Access Patterns for Robots and Humans in Web Archives 3
0.204.48.255 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/ahrefs.comHTTP/1.0"200 96037 "-" "Mozilla/5.0(Windows NT 6.1; rv:10.0) Gecko/20100101
Firefox/10.0"
0.241.150.135 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/hperlinknow.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1)
Gecko/20100101Firefox/9.0.1"
0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20120202063732/http://b.scorecardresearch.com/beacon.jsHTTP/1.1"403 127
"http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77
Safari/535.7"
0.62.96.215 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://carbolicsmokeall.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
rv:1.9.2.26)Gecko/20120128BTRS87692Firefox/3.6.26( .NET CLR 3.5.30729; .NET4.0E)"
0.55.251.218 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/image/15_bar.gifHTTP/1.1"302 0
"http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/alltop.html""Mozilla/5.0(compatible;MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
0.123.255.46 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20090321190441im_/http://www.chermside.com/wp-content/uploads/bowlsclub.gifHTTP/1.1"302 0
"http://web.archive.org/web/20090321190441/http://www.chermside.com/""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77Safari/535.7"
0.73.170.52 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://www.pornhub.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2)
AppleWebKit/534.51.22(KHTML, like Gecko) Version/5.1.1Safari/534.51.22"
0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/bot_1.gifHTTP/1.1"302 0
"http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko)
Chrome/17.0.963.46Safari/535.11"
0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://liveweb.archive.org/http://photos.modelmayhem.com/avatars/1/9/3/2/6/7/4f1e2fb2e4ed4_t.jpgHTTP/1.1"200 7682
"http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77
Safari/535.7"
0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/sub_right.gifHTTP/1.1"302 0
"http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko)
Chrome/17.0.963.46Safari/535.11"
0.227.26.32 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.friendscafe.orgHTTP/1.1"200 102279 "-" "Mozilla/5.0 (Windows NT 5.1; rv:9.0.1)
Gecko/20100101Firefox/9.0.1"
0.29.194.93 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://tw.18dao.netHTTP/1.1"200 96951 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW;
rv:1.9.2.25)Gecko/20111212 AlexaToolbar/alxf-2.13Firefox/3.6.25( .NET CLR 3.5.30729)"
0.90.22.18 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.bookingbug.comHTTP/1.1"200 104622 "-" "Mozilla/5.0 (Windows; U; Windows NT
6.1; en-US; rv:1.9.2.18)Gecko/20110614Firefox/3.6.18"
0.7.73.16 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.orghttp://web.archive.org/web/20070930062203/http://profiles.yahoo.com/powertrip_02HTTP/1.1"302 0 "-"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.15)Gecko/20110303Firefox/3.6.15& vbCrlfAccept:text/javascript,image/gif,image/x-xbitmap,image/jpeg,image/pjpeg,
application/x-shockwave-flash,application/vnd.ms-excel,applicati"
0.49.73.161 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20061230183944im_/http://www.i3dthemes.com/_images/icons/rss_small.jpgHTTP/1.1"302 0 "-"
"Mozilla/5.0"
…
Access Patterns for Robots and Humans in Web Archives
Motivation
• There have been many studies for web access
patterns
• This is the first study using Internet Archive’s
web server logs to discover how users access
web archives
4
Access Patterns for Robots and Humans in Web Archives
Research Question
• How do users, both humans and robots,
access web archives?
5
Access Patterns for Robots and Humans in Web Archives
Methodology
6
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
7
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
0.247.222.86 - - [02/Feb/2012:07:03:55 +0000] "GET
http://web.archive.org/web/20130318135600/http://www.cnn.com HTTP/1.1"
200 18875 "http://wayback.archive.org/web/*/http://www.aura.vu"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7
(KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7"}
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
8
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
9
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
IPs had been anonymized by Internet Archive
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
10
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
11
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
12
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/*/http://www.cnn.com
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
13
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/*/http://www.cnn.com
TimeMap
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
14
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/20130318135600/http://www.cnn.com
0.247.222.86 - - [02/Feb/2012:07:03:55 +0000] "GET
http://web.archive.org/web/20130318135600/http://www.cnn.com/
HTTP/1.1" 200 18875
"http://wayback.archive.org/web/*/http://www.cnn.com"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77
Safari/535.7"} Memento
TimeMap
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
15
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/*/http://www.cnn.com
• Protocol: HTTP/1.1
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
16
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/*/http://www.cnn.com
• Protocol: HTTP/1.1
• HTTP status code: 200
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
17
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/*/http://www.cnn.com
• Protocol: HTTP/1.1
• HTTP status code: 200
• Bytes sent: 96433
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
18
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/*/http://www.cnn.com
• Protocol: HTTP/1.1
• HTTP status code: 200
• Bytes sent: 96433
• Referring URI: http://www.archive.org/web/web.php
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
19
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/*/http://www.cnn.com
• Protocol: HTTP/1.1
• HTTP status code: 200
• Bytes sent: 96433
• Referring URI: http://www.archive.org/web/web.php
• User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7
(KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7
Access Patterns for Robots and Humans in Web Archives
Dataset
• More than 82 million requests per day come
to the Wayback Machine
• Cluster Sampling: a week, Feb. 2-8, 2012
• Random Sampling: random slice (2 million
requests) from each day of the week
• We looked at all these days and found that 2
Feb. is a representative sample
– For details, look at Section 4.2 and Table 3 in the
paper
20
Access Patterns for Robots and Humans in Web Archives
Pre-processing
• Data Cleaning
• Session Identification
• Robot Detection
21
Access Patterns for Robots and Humans in Web Archives
Data Cleaning
22
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
http://web.archive.org/web/20070519015308/
http://www.jcdl.org/
Access Patterns for Robots and Humans in Web Archives
Embedded Resources
23
http://web.archive.org/web/20070519015308/
http://www.jcdl.org/
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
Access Patterns for Robots and Humans in Web Archives
Embedded Resources
24
http://web.archive.org/web/20070519015308/
http://www.jcdl.org/
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
Access Patterns for Robots and Humans in Web Archives
Static Resources
25
http://web.archive.org/web/20070519015308/
http://www.jcdl.org/
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
Access Patterns for Robots and Humans in Web Archives
Static Resources
26
http://web.archive.org/web/20070519015308/
http://www.jcdl.org/
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
Access Patterns for Robots and Humans in Web Archives
Invalid requests
27
http://web.archive.org/web/20100102003557/
about:blank
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
Access Patterns for Robots and Humans in Web Archives
Invalid requests
28
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
http://web.archive.org/web/20100102003557/
about:blank
Access Patterns for Robots and Humans in Web Archives
Requests that had 3xx status code
29
http://web.archive.org/web/20130114160045/
http://www.jcdl.org/0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
Access Patterns for Robots and Humans in Web Archives
Requests that had 3xx status code
30
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
http://web.archive.org/web/20130114160045/
http://www.jcdl.org/
curl -I "http://web.archive.org/web/20140004100000/http://www.jcdl.org/"
HTTP/1.1 302 Moved Temporarily
Server: Tengine/1.4.3
Date: Tue, 02 Jul 2013 19:48:59 GMT
Content-Type: application/octet-stream
Content-Length: 0
Connection: keep-alive
set-cookie: wayback_server=10; Domain=archive.org; Path=/; Expires=Thu, 01-Aug-13 19:48:59 GMT;
Location: /web/20130114160045/http://www.jcdl.org/
Access Patterns for Robots and Humans in Web Archives
Requests that had 3xx status code
31
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
http://web.archive.org/web/20130114160045/
http://www.jcdl.org/
Access Patterns for Robots and Humans in Web Archives
Session: set of web pages requested
by a particular user
32
1 mins 4 mins
3 mins 9 mins
p1 p2 p3
p4 p5
Access Patterns for Robots and Humans in Web Archives
Session: set of web pages requested
by a particular user
33
1 mins 4 mins
3 mins 9 mins
p1 p2 p3
p4 p5
Time between two
requests ≤ 10
Access Patterns for Robots and Humans in Web Archives
Session Identification
• Grouping: based on the IP and User-
Agent
• Threshold timeout: 10 minutes Liu et al. 2007,
Spiliopoulou et al. 2003
34
Access Patterns for Robots and Humans in Web Archives
Robot Detection is a big challenge
35
I’m not a
robot
Access Patterns for Robots and Humans in Web Archives
Distinguishing Robots from
Humans
36
Access Patterns for Robots and Humans in Web Archives
User-Agent Check
0.182.141.149 - -
[02/Feb/2012:00:01:51 +0000] "GET
http://wayback.archive.org/web/199906
01000000*/http://www.belizefirst.com/
HTTP/1.0" 200 98507 "-"
"Python-urllib/1.17"
37
Access Patterns for Robots and Humans in Web Archives
Number of User-Agent per IP
38
Access Patterns for Robots and Humans in Web Archives
Number of User-Agent per IP
39
One IP with User-Agent ≥20 = lying Robot
Access Patterns for Robots and Humans in Web Archives
Robots.txt file
• Session that contains an access for robot.txt is
a robot
40
0.182.141.149 - - [02/Feb/2012:06:20:46 +0000] "GET
http://web.archive.org/robots.txt HTTP/1.0" 200 125 "-"
"Mozilla/5.0 (compatible; MJ12bot/v1.4.1;
http://www.majestic12.co.uk/bot.php?+)"
0.182.141.149 - - [02/Feb/2012:06:20:19 +0000] "GET
http://wayback.archive.org/web/*/http://www.devilscafe.in
HTTP/1.1" 404 2168 "-" "Mozilla/5.0 (compatible;
MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)"
0.182.141.149 - - [02/Feb/2012:06:21:19 +0000] "GET
http://wayback.archive.org/web/*/http://www.genie.co.il
HTTP/1.1" 200 96205 "-" "Mozilla/5.0 (compatible;
MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)"
Access Patterns for Robots and Humans in Web Archives
6 requests, 2 seconds  robot
41
0.182.141.149 - - [02/Feb/2012:07:00:01 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 106433 “-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
0.182.141.149 - - [02/Feb/2012:07:00:01 +0000] "GET
http://wayback.archive.org/web/*/http://www.bbc.com HTTP/1.1" 200 566433 "-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET
http://wayback.archive.org/web/*/http://www.google.com HTTP/1.1" 200 96433 "-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET
http://wayback.archive.org/web/*/http://www.yahoo.com HTTP/1.1" 200 933333 "-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET
http://wayback.archive.org/web/*/http://www.bing.com HTTP/1.1" 200 964333 “-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
0.182.141.149 - - [02/Feb/2012:07:00:3 +0000] "GET
http://wayback.archive.org/web/*/http://www.jcdl.org HTTP/1.1" 200 123233 “-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
Access Patterns for Robots and Humans in Web Archives
3 requests, 520 seconds
(9 minutes)  human
42
0.11.160.13 - - [02/Feb/2012:07:00:00 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 106433 "-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
0.11.160.13 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/20100330042821/http://www.cnn.com HTTP/1.1" 200
566433 " http://wayback.archive.org/web/*/http://www.cnn.com" "Mozilla/5.0
(Macintosh; Intel Mac OS X 10_6_8)
0.11.160.13 - - [02/Feb/2012:07:08:00 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "
http://wayback.archive.org/web/*/http://www.cnn.com" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8)
Access Patterns for Robots and Humans in Web Archives
0.5 is a Good Browsing Speed Threshold
for Distinguishing Robots and Humans (Nithya
et al. 2012 , Reddy et al. 2012)
43
Browsing Speed (BS)
BS =
𝑠𝑒𝑠𝑠𝑖𝑜𝑛 𝑙𝑒𝑛𝑔𝑡ℎ
𝑠𝑒𝑠𝑠𝑖𝑜𝑛 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛
𝐵𝑆 =
≤ 0.5 𝐻𝑢𝑚𝑎𝑛𝑠
> 0.5 𝑅𝑜𝑏𝑜𝑡𝑠
Access Patterns for Robots and Humans in Web Archives
Image-to-HTML Ratio
44
If I download
these, I’m
not a robot
Access Patterns for Robots and Humans in Web Archives
Image-to-HTML Ratio
• The ratio between the number of image files
and the number of HTML files per session
• Robots sessions are less than 1:10 image to
HTML ratio, as suggested by Stassopoulou et al. 2005
45
Access Patterns for Robots and Humans in Web Archives
Image-to-HTML is the best in
detecting robots
46
Access Patterns for Robots and Humans in Web Archives
Traffic Analysis
• Records remaining after cleaning: 21.3%
(426,317 out of 2M)
• Unique IPs: 21,932
• Users: 33,841
• Sessions: 37,634
47
Access Patterns for Robots and Humans in Web Archives
Robots have longer sessions
than humans
48
Access Patterns for Robots and Humans in Web Archives
Humans spend more time
than Robots
49
Access Patterns for Robots and Humans in Web Archives
Robots outnumber humans
in terms of:
50
Sessions
10
1
Raw HTTP
Accesses
5
4
MB
Transferred
4
1
Access Patterns for Robots and Humans in Web Archives
User Access Patterns in
Web Archives
• Dip
• Dive
• Slide
• Skim
51
Access Patterns for Robots and Humans in Web Archives
Dip: simple access to
TimeMap or memento
52
TimeMap Memento
Access Patterns for Robots and Humans in Web Archives
Dive: different pages at approximately
the same archive time
53
November 12, 2009 11:55:54
November 12, 2009 05:37:22
November 12, 2009 05:38:02
Access Patterns for Robots and Humans in Web Archives
Slide: the same page at different
archive times
54
March 18, 2013 13:56:00 November 15, 2009 05:33:01 July 31, 2006 23:55:45
Access Patterns for Robots and Humans in Web Archives
Skim: lists of TimeMaps
55
http://web.archive.org/web/*/
http://cnn.com/
http://web.archive.org/web/*/
http://www.bbc.com/
http://web.archive.org/web/*/
http://www.nytimes.com/
Access Patterns for Robots and Humans in Web Archives
Everybody Dips, Humans Dive,
Robots Skim
56
Robots (34,203 sessions) Humans (3,431 sessions)
Access Patterns for Robots and Humans in Web Archives
Pattern Length
57
Slide length = 4
Skim length = 3
Access Patterns for Robots and Humans in Web Archives
Small Medians, Large
Standard Deviations
58
Access Patterns for Robots and Humans in Web Archives
Only recent past exhibits
locality of reference
59
Access Patterns for Robots and Humans in Web Archives
Only recent past exhibits
locality of reference
60
Cache replacement
policies should
favor recent past
Access Patterns for Robots and Humans in Web Archives
Conclusions
• We introduced traffic analysis for the Wayback Machine
• We discovered that robots outnumber humans
– 10:1 in terms of sessions
– 5:4 in terms of raw, unfiltered requests
– 4:1 in terms of megabytes transferred
– Robots need APIs http://arxiv.org/abs/1305.5959
• We Identified four major web archive access patterns
– Dip
– Slide
– Dive
– Skim
• Only recent past exhibits locality of reference
61
Access Patterns for Robots and Humans in Web Archives
Extra Slides
62
Access Patterns for Robots and Humans in Web Archives
The Features of the Samples
Days Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7 Feb 8 Mean SD SE
Duration 0:33:12 0:31:15 0:40:34 0:42:57 0:29:35 0:25:45 0:24:33 0:32:33 0:06:29 0:02:27
GET 98.4% 99.3% 97.7% 97.9% 99.4% 99.7% 99.8% 99% 0.8% 0.3%
Embedded 47.4% 34.8% 43.7% 42.7% 41.9% 44.7% 46.8% 43.1% 3.9% 1.5%
SI Robots 6.2% 12.0% 7.7% 7.7% 2.9% 3.5% 3.8% 6.3% 3.0% 1.1%
NullRef 42.6% 56.6% 47.5% 47.0% 49.4% 42.6% 43.9% 47.1% 4.6% 1.7%
s2xx 33.7% 32.4% 34.2% 33.2% 34.1% 33.4% 33.6% 33.5% 0.6% 0.2%
s3xx 51.8% 52.3% 50.8% 52.2% 51.7% 51.9% 53.2% 52.0% 0.7% 0.3%
s4xx 11.7% 13.1% 12.0% 11.6% 11.2% 10.3% 10.1% 11.4% 0.9% 0.4%
s5xx 2.8% 2.3% 3.0% 2.9% 3.0% 4.4% 3.1% 3.1% 0.6% 0.2%
Cleaned 21.3% 23.0% 17.6% 17.7% 20.7% 18.1% 16.9% 19.3% 2.2% 0.8%
Sessions 37,634 31,731 32,159 28,750 36,087 35,848 32,117 33,475 2,896 1,094
63
Access Patterns for Robots and Humans in Web Archives
Very Small Standard Errors among
Samples
64
Days Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7 Feb 8 Mean SD SE
Duration 0:33:12 0:31:15 0:40:34 0:42:57 0:29:35 0:25:45 0:24:33 0:32:33 0:06:29 0:02:27
GET 98.4% 99.3% 97.7% 97.9% 99.4% 99.7% 99.8% 99% 0.8% 0.3%
Embedded 47.4% 34.8% 43.7% 42.7% 41.9% 44.7% 46.8% 43.1% 3.9% 1.5%
SI Robots 6.2% 12.0% 7.7% 7.7% 2.9% 3.5% 3.8% 6.3% 3.0% 1.1%
NullRef 42.6% 56.6% 47.5% 47.0% 49.4% 42.6% 43.9% 47.1% 4.6% 1.7%
s2xx 33.7% 32.4% 34.2% 33.2% 34.1% 33.4% 33.6% 33.5% 0.6% 0.2%
s3xx 51.8% 52.3% 50.8% 52.2% 51.7% 51.9% 53.2% 52.0% 0.7% 0.3%
s4xx 11.7% 13.1% 12.0% 11.6% 11.2% 10.3% 10.1% 11.4% 0.9% 0.4%
s5xx 2.8% 2.3% 3.0% 2.9% 3.0% 4.4% 3.1% 3.1% 0.6% 0.2%
Cleaned 21.3% 23.0% 17.6% 17.7% 20.7% 18.1% 16.9% 19.3% 2.2% 0.8%
Sessions 37,634 31,731 32,159 28,750 36,087 35,848 32,117 33,475 2,896 1,094
Access Patterns for Robots and Humans in Web Archives
Feb. 2, 2012 sample is representative
Days Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7 Feb 8 Mean SD SE
Duration 0:33:12 0:31:15 0:40:34 0:42:57 0:29:35 0:25:45 0:24:33 0:32:33 0:06:29 0:02:27
GET 98.4% 99.3% 97.7% 97.9% 99.4% 99.7% 99.8% 99% 0.8% 0.3%
Embedded 47.4% 34.8% 43.7% 42.7% 41.9% 44.7% 46.8% 43.1% 3.9% 1.5%
SI Robots 6.2% 12.0% 7.7% 7.7% 2.9% 3.5% 3.8% 6.3% 3.0% 1.1%
NullRef 42.6% 56.6% 47.5% 47.0% 49.4% 42.6% 43.9% 47.1% 4.6% 1.7%
s2xx 33.7% 32.4% 34.2% 33.2% 34.1% 33.4% 33.6% 33.5% 0.6% 0.2%
s3xx 51.8% 52.3% 50.8% 52.2% 51.7% 51.9% 53.2% 52.0% 0.7% 0.3%
s4xx 11.7% 13.1% 12.0% 11.6% 11.2% 10.3% 10.1% 11.4% 0.9% 0.4%
s5xx 2.8% 2.3% 3.0% 2.9% 3.0% 4.4% 3.1% 3.1% 0.6% 0.2%
Cleaned 21.3% 23.0% 17.6% 17.7% 20.7% 18.1% 16.9% 19.3% 2.2% 0.8%
Sessions 37,634 31,731 32,159 28,750 36,087 35,848 32,117 33,475 2,896 1,094
65
Access Patterns for Robots and Humans in Web Archives
Results of Data Cleaning
• The records remained after cleaning are 21.3%
of the requests in the raw file.
66
Access Patterns for Robots and Humans in Web Archives
Robots outnumber humans
in terms of:
67
Sessions
10
1
Raw HTTP
Accesses
5
4
MB
Transferred
4
1
Users # Sessions # Requests
(Raw)
# Transferred MB
Robots 34,203
(90.9%)
1,002,573
(50.1%)
20,010
Humans 3,431
(9.10%)
810,049
(40.5%)
4,459
Access Patterns for Robots and Humans in Web Archives
Humans exhibit Dip and Dive,
while robots exhibit Dip and Skim
68
Robots Humans
328 Slides
571 Dives
1167
Slides
1942
Dives
Access Patterns for Robots and Humans in Web Archives
The total number of mementos available
for 2011 was similar to previous years.
69

Access Patterns for Robots and Humans in Web Archives

  • 1.
    Access Patterns forRobots and Humans in Web Archives Yasmin AlNoamany, Michele C. Weigle, Michael L. Nelson Computer Science Department Old Dominion University, Norfolk, VA yasmin@cs.odu.edu Access Patterns for Robots and Humans in Web Archives
  • 2.
    Access Patterns forRobots and Humans in Web Archives 2 0.204.48.255 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/ahrefs.comHTTP/1.0"200 96037 "-" "Mozilla/5.0(Windows NT 6.1; rv:10.0) Gecko/20100101 Firefox/10.0" 0.241.150.135 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/hperlinknow.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1) Gecko/20100101Firefox/9.0.1" 0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20120202063732/http://b.scorecardresearch.com/beacon.jsHTTP/1.1"403 127 "http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" 0.62.96.215 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://carbolicsmokeall.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.26)Gecko/20120128BTRS87692Firefox/3.6.26( .NET CLR 3.5.30729; .NET4.0E)" 0.55.251.218 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/image/15_bar.gifHTTP/1.1"302 0 "http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/alltop.html""Mozilla/5.0(compatible;MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" 0.123.255.46 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20090321190441im_/http://www.chermside.com/wp-content/uploads/bowlsclub.gifHTTP/1.1"302 0 "http://web.archive.org/web/20090321190441/http://www.chermside.com/""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77Safari/535.7" 0.73.170.52 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://www.pornhub.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/534.51.22(KHTML, like Gecko) Version/5.1.1Safari/534.51.22" 0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/bot_1.gifHTTP/1.1"302 0 "http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko) Chrome/17.0.963.46Safari/535.11" 0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://liveweb.archive.org/http://photos.modelmayhem.com/avatars/1/9/3/2/6/7/4f1e2fb2e4ed4_t.jpgHTTP/1.1"200 7682 "http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" 0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/sub_right.gifHTTP/1.1"302 0 "http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko) Chrome/17.0.963.46Safari/535.11" 0.227.26.32 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.friendscafe.orgHTTP/1.1"200 102279 "-" "Mozilla/5.0 (Windows NT 5.1; rv:9.0.1) Gecko/20100101Firefox/9.0.1" 0.29.194.93 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://tw.18dao.netHTTP/1.1"200 96951 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW; rv:1.9.2.25)Gecko/20111212 AlexaToolbar/alxf-2.13Firefox/3.6.25( .NET CLR 3.5.30729)" 0.90.22.18 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.bookingbug.comHTTP/1.1"200 104622 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.18)Gecko/20110614Firefox/3.6.18" 0.7.73.16 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.orghttp://web.archive.org/web/20070930062203/http://profiles.yahoo.com/powertrip_02HTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.15)Gecko/20110303Firefox/3.6.15& vbCrlfAccept:text/javascript,image/gif,image/x-xbitmap,image/jpeg,image/pjpeg, application/x-shockwave-flash,application/vnd.ms-excel,applicati" 0.49.73.161 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20061230183944im_/http://www.i3dthemes.com/_images/icons/rss_small.jpgHTTP/1.1"302 0 "-" "Mozilla/5.0" …
  • 3.
    Access Patterns forRobots and Humans in Web Archives 3 0.204.48.255 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/ahrefs.comHTTP/1.0"200 96037 "-" "Mozilla/5.0(Windows NT 6.1; rv:10.0) Gecko/20100101 Firefox/10.0" 0.241.150.135 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/hperlinknow.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1) Gecko/20100101Firefox/9.0.1" 0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20120202063732/http://b.scorecardresearch.com/beacon.jsHTTP/1.1"403 127 "http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" 0.62.96.215 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://carbolicsmokeall.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.26)Gecko/20120128BTRS87692Firefox/3.6.26( .NET CLR 3.5.30729; .NET4.0E)" 0.55.251.218 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/image/15_bar.gifHTTP/1.1"302 0 "http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/alltop.html""Mozilla/5.0(compatible;MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" 0.123.255.46 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20090321190441im_/http://www.chermside.com/wp-content/uploads/bowlsclub.gifHTTP/1.1"302 0 "http://web.archive.org/web/20090321190441/http://www.chermside.com/""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77Safari/535.7" 0.73.170.52 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://www.pornhub.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/534.51.22(KHTML, like Gecko) Version/5.1.1Safari/534.51.22" 0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/bot_1.gifHTTP/1.1"302 0 "http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko) Chrome/17.0.963.46Safari/535.11" 0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://liveweb.archive.org/http://photos.modelmayhem.com/avatars/1/9/3/2/6/7/4f1e2fb2e4ed4_t.jpgHTTP/1.1"200 7682 "http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" 0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/sub_right.gifHTTP/1.1"302 0 "http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko) Chrome/17.0.963.46Safari/535.11" 0.227.26.32 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.friendscafe.orgHTTP/1.1"200 102279 "-" "Mozilla/5.0 (Windows NT 5.1; rv:9.0.1) Gecko/20100101Firefox/9.0.1" 0.29.194.93 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://tw.18dao.netHTTP/1.1"200 96951 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW; rv:1.9.2.25)Gecko/20111212 AlexaToolbar/alxf-2.13Firefox/3.6.25( .NET CLR 3.5.30729)" 0.90.22.18 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.bookingbug.comHTTP/1.1"200 104622 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.18)Gecko/20110614Firefox/3.6.18" 0.7.73.16 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.orghttp://web.archive.org/web/20070930062203/http://profiles.yahoo.com/powertrip_02HTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.15)Gecko/20110303Firefox/3.6.15& vbCrlfAccept:text/javascript,image/gif,image/x-xbitmap,image/jpeg,image/pjpeg, application/x-shockwave-flash,application/vnd.ms-excel,applicati" 0.49.73.161 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20061230183944im_/http://www.i3dthemes.com/_images/icons/rss_small.jpgHTTP/1.1"302 0 "-" "Mozilla/5.0" …
  • 4.
    Access Patterns forRobots and Humans in Web Archives Motivation • There have been many studies for web access patterns • This is the first study using Internet Archive’s web server logs to discover how users access web archives 4
  • 5.
    Access Patterns forRobots and Humans in Web Archives Research Question • How do users, both humans and robots, access web archives? 5
  • 6.
    Access Patterns forRobots and Humans in Web Archives Methodology 6
  • 7.
    Access Patterns forRobots and Humans in Web Archives Sample of Wayback Machine access logs 7 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" 0.247.222.86 - - [02/Feb/2012:07:03:55 +0000] "GET http://web.archive.org/web/20130318135600/http://www.cnn.com HTTP/1.1" 200 18875 "http://wayback.archive.org/web/*/http://www.aura.vu" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7"}
  • 8.
    Access Patterns forRobots and Humans in Web Archives Sample of Wayback Machine access logs 8 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86
  • 9.
    Access Patterns forRobots and Humans in Web Archives Sample of Wayback Machine access logs 9 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 IPs had been anonymized by Internet Archive
  • 10.
    Access Patterns forRobots and Humans in Web Archives Sample of Wayback Machine access logs 10 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000
  • 11.
    Access Patterns forRobots and Humans in Web Archives Sample of Wayback Machine access logs 11 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET
  • 12.
    Access Patterns forRobots and Humans in Web Archives Sample of Wayback Machine access logs 12 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com
  • 13.
    Access Patterns forRobots and Humans in Web Archives Sample of Wayback Machine access logs 13 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com TimeMap
  • 14.
    Access Patterns forRobots and Humans in Web Archives Sample of Wayback Machine access logs 14 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/20130318135600/http://www.cnn.com 0.247.222.86 - - [02/Feb/2012:07:03:55 +0000] "GET http://web.archive.org/web/20130318135600/http://www.cnn.com/ HTTP/1.1" 200 18875 "http://wayback.archive.org/web/*/http://www.cnn.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7"} Memento TimeMap
  • 15.
    Access Patterns forRobots and Humans in Web Archives Sample of Wayback Machine access logs 15 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com • Protocol: HTTP/1.1
  • 16.
    Access Patterns forRobots and Humans in Web Archives Sample of Wayback Machine access logs 16 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com • Protocol: HTTP/1.1 • HTTP status code: 200
  • 17.
    Access Patterns forRobots and Humans in Web Archives Sample of Wayback Machine access logs 17 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com • Protocol: HTTP/1.1 • HTTP status code: 200 • Bytes sent: 96433
  • 18.
    Access Patterns forRobots and Humans in Web Archives Sample of Wayback Machine access logs 18 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com • Protocol: HTTP/1.1 • HTTP status code: 200 • Bytes sent: 96433 • Referring URI: http://www.archive.org/web/web.php
  • 19.
    Access Patterns forRobots and Humans in Web Archives Sample of Wayback Machine access logs 19 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com • Protocol: HTTP/1.1 • HTTP status code: 200 • Bytes sent: 96433 • Referring URI: http://www.archive.org/web/web.php • User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7
  • 20.
    Access Patterns forRobots and Humans in Web Archives Dataset • More than 82 million requests per day come to the Wayback Machine • Cluster Sampling: a week, Feb. 2-8, 2012 • Random Sampling: random slice (2 million requests) from each day of the week • We looked at all these days and found that 2 Feb. is a representative sample – For details, look at Section 4.2 and Table 3 in the paper 20
  • 21.
    Access Patterns forRobots and Humans in Web Archives Pre-processing • Data Cleaning • Session Identification • Robot Detection 21
  • 22.
    Access Patterns forRobots and Humans in Web Archives Data Cleaning 22 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0" http://web.archive.org/web/20070519015308/ http://www.jcdl.org/
  • 23.
    Access Patterns forRobots and Humans in Web Archives Embedded Resources 23 http://web.archive.org/web/20070519015308/ http://www.jcdl.org/ 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
  • 24.
    Access Patterns forRobots and Humans in Web Archives Embedded Resources 24 http://web.archive.org/web/20070519015308/ http://www.jcdl.org/ 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
  • 25.
    Access Patterns forRobots and Humans in Web Archives Static Resources 25 http://web.archive.org/web/20070519015308/ http://www.jcdl.org/ 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
  • 26.
    Access Patterns forRobots and Humans in Web Archives Static Resources 26 http://web.archive.org/web/20070519015308/ http://www.jcdl.org/ 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
  • 27.
    Access Patterns forRobots and Humans in Web Archives Invalid requests 27 http://web.archive.org/web/20100102003557/ about:blank 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
  • 28.
    Access Patterns forRobots and Humans in Web Archives Invalid requests 28 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0" http://web.archive.org/web/20100102003557/ about:blank
  • 29.
    Access Patterns forRobots and Humans in Web Archives Requests that had 3xx status code 29 http://web.archive.org/web/20130114160045/ http://www.jcdl.org/0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
  • 30.
    Access Patterns forRobots and Humans in Web Archives Requests that had 3xx status code 30 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0" http://web.archive.org/web/20130114160045/ http://www.jcdl.org/ curl -I "http://web.archive.org/web/20140004100000/http://www.jcdl.org/" HTTP/1.1 302 Moved Temporarily Server: Tengine/1.4.3 Date: Tue, 02 Jul 2013 19:48:59 GMT Content-Type: application/octet-stream Content-Length: 0 Connection: keep-alive set-cookie: wayback_server=10; Domain=archive.org; Path=/; Expires=Thu, 01-Aug-13 19:48:59 GMT; Location: /web/20130114160045/http://www.jcdl.org/
  • 31.
    Access Patterns forRobots and Humans in Web Archives Requests that had 3xx status code 31 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0" http://web.archive.org/web/20130114160045/ http://www.jcdl.org/
  • 32.
    Access Patterns forRobots and Humans in Web Archives Session: set of web pages requested by a particular user 32 1 mins 4 mins 3 mins 9 mins p1 p2 p3 p4 p5
  • 33.
    Access Patterns forRobots and Humans in Web Archives Session: set of web pages requested by a particular user 33 1 mins 4 mins 3 mins 9 mins p1 p2 p3 p4 p5 Time between two requests ≤ 10
  • 34.
    Access Patterns forRobots and Humans in Web Archives Session Identification • Grouping: based on the IP and User- Agent • Threshold timeout: 10 minutes Liu et al. 2007, Spiliopoulou et al. 2003 34
  • 35.
    Access Patterns forRobots and Humans in Web Archives Robot Detection is a big challenge 35 I’m not a robot
  • 36.
    Access Patterns forRobots and Humans in Web Archives Distinguishing Robots from Humans 36
  • 37.
    Access Patterns forRobots and Humans in Web Archives User-Agent Check 0.182.141.149 - - [02/Feb/2012:00:01:51 +0000] "GET http://wayback.archive.org/web/199906 01000000*/http://www.belizefirst.com/ HTTP/1.0" 200 98507 "-" "Python-urllib/1.17" 37
  • 38.
    Access Patterns forRobots and Humans in Web Archives Number of User-Agent per IP 38
  • 39.
    Access Patterns forRobots and Humans in Web Archives Number of User-Agent per IP 39 One IP with User-Agent ≥20 = lying Robot
  • 40.
    Access Patterns forRobots and Humans in Web Archives Robots.txt file • Session that contains an access for robot.txt is a robot 40 0.182.141.149 - - [02/Feb/2012:06:20:46 +0000] "GET http://web.archive.org/robots.txt HTTP/1.0" 200 125 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)" 0.182.141.149 - - [02/Feb/2012:06:20:19 +0000] "GET http://wayback.archive.org/web/*/http://www.devilscafe.in HTTP/1.1" 404 2168 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)" 0.182.141.149 - - [02/Feb/2012:06:21:19 +0000] "GET http://wayback.archive.org/web/*/http://www.genie.co.il HTTP/1.1" 200 96205 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)"
  • 41.
    Access Patterns forRobots and Humans in Web Archives 6 requests, 2 seconds  robot 41 0.182.141.149 - - [02/Feb/2012:07:00:01 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 106433 “-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:01 +0000] "GET http://wayback.archive.org/web/*/http://www.bbc.com HTTP/1.1" 200 566433 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET http://wayback.archive.org/web/*/http://www.google.com HTTP/1.1" 200 96433 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET http://wayback.archive.org/web/*/http://www.yahoo.com HTTP/1.1" 200 933333 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET http://wayback.archive.org/web/*/http://www.bing.com HTTP/1.1" 200 964333 “-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:3 +0000] "GET http://wayback.archive.org/web/*/http://www.jcdl.org HTTP/1.1" 200 123233 “-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
  • 42.
    Access Patterns forRobots and Humans in Web Archives 3 requests, 520 seconds (9 minutes)  human 42 0.11.160.13 - - [02/Feb/2012:07:00:00 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 106433 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.11.160.13 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/20100330042821/http://www.cnn.com HTTP/1.1" 200 566433 " http://wayback.archive.org/web/*/http://www.cnn.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.11.160.13 - - [02/Feb/2012:07:08:00 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 " http://wayback.archive.org/web/*/http://www.cnn.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
  • 43.
    Access Patterns forRobots and Humans in Web Archives 0.5 is a Good Browsing Speed Threshold for Distinguishing Robots and Humans (Nithya et al. 2012 , Reddy et al. 2012) 43 Browsing Speed (BS) BS = 𝑠𝑒𝑠𝑠𝑖𝑜𝑛 𝑙𝑒𝑛𝑔𝑡ℎ 𝑠𝑒𝑠𝑠𝑖𝑜𝑛 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝐵𝑆 = ≤ 0.5 𝐻𝑢𝑚𝑎𝑛𝑠 > 0.5 𝑅𝑜𝑏𝑜𝑡𝑠
  • 44.
    Access Patterns forRobots and Humans in Web Archives Image-to-HTML Ratio 44 If I download these, I’m not a robot
  • 45.
    Access Patterns forRobots and Humans in Web Archives Image-to-HTML Ratio • The ratio between the number of image files and the number of HTML files per session • Robots sessions are less than 1:10 image to HTML ratio, as suggested by Stassopoulou et al. 2005 45
  • 46.
    Access Patterns forRobots and Humans in Web Archives Image-to-HTML is the best in detecting robots 46
  • 47.
    Access Patterns forRobots and Humans in Web Archives Traffic Analysis • Records remaining after cleaning: 21.3% (426,317 out of 2M) • Unique IPs: 21,932 • Users: 33,841 • Sessions: 37,634 47
  • 48.
    Access Patterns forRobots and Humans in Web Archives Robots have longer sessions than humans 48
  • 49.
    Access Patterns forRobots and Humans in Web Archives Humans spend more time than Robots 49
  • 50.
    Access Patterns forRobots and Humans in Web Archives Robots outnumber humans in terms of: 50 Sessions 10 1 Raw HTTP Accesses 5 4 MB Transferred 4 1
  • 51.
    Access Patterns forRobots and Humans in Web Archives User Access Patterns in Web Archives • Dip • Dive • Slide • Skim 51
  • 52.
    Access Patterns forRobots and Humans in Web Archives Dip: simple access to TimeMap or memento 52 TimeMap Memento
  • 53.
    Access Patterns forRobots and Humans in Web Archives Dive: different pages at approximately the same archive time 53 November 12, 2009 11:55:54 November 12, 2009 05:37:22 November 12, 2009 05:38:02
  • 54.
    Access Patterns forRobots and Humans in Web Archives Slide: the same page at different archive times 54 March 18, 2013 13:56:00 November 15, 2009 05:33:01 July 31, 2006 23:55:45
  • 55.
    Access Patterns forRobots and Humans in Web Archives Skim: lists of TimeMaps 55 http://web.archive.org/web/*/ http://cnn.com/ http://web.archive.org/web/*/ http://www.bbc.com/ http://web.archive.org/web/*/ http://www.nytimes.com/
  • 56.
    Access Patterns forRobots and Humans in Web Archives Everybody Dips, Humans Dive, Robots Skim 56 Robots (34,203 sessions) Humans (3,431 sessions)
  • 57.
    Access Patterns forRobots and Humans in Web Archives Pattern Length 57 Slide length = 4 Skim length = 3
  • 58.
    Access Patterns forRobots and Humans in Web Archives Small Medians, Large Standard Deviations 58
  • 59.
    Access Patterns forRobots and Humans in Web Archives Only recent past exhibits locality of reference 59
  • 60.
    Access Patterns forRobots and Humans in Web Archives Only recent past exhibits locality of reference 60 Cache replacement policies should favor recent past
  • 61.
    Access Patterns forRobots and Humans in Web Archives Conclusions • We introduced traffic analysis for the Wayback Machine • We discovered that robots outnumber humans – 10:1 in terms of sessions – 5:4 in terms of raw, unfiltered requests – 4:1 in terms of megabytes transferred – Robots need APIs http://arxiv.org/abs/1305.5959 • We Identified four major web archive access patterns – Dip – Slide – Dive – Skim • Only recent past exhibits locality of reference 61
  • 62.
    Access Patterns forRobots and Humans in Web Archives Extra Slides 62
  • 63.
    Access Patterns forRobots and Humans in Web Archives The Features of the Samples Days Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7 Feb 8 Mean SD SE Duration 0:33:12 0:31:15 0:40:34 0:42:57 0:29:35 0:25:45 0:24:33 0:32:33 0:06:29 0:02:27 GET 98.4% 99.3% 97.7% 97.9% 99.4% 99.7% 99.8% 99% 0.8% 0.3% Embedded 47.4% 34.8% 43.7% 42.7% 41.9% 44.7% 46.8% 43.1% 3.9% 1.5% SI Robots 6.2% 12.0% 7.7% 7.7% 2.9% 3.5% 3.8% 6.3% 3.0% 1.1% NullRef 42.6% 56.6% 47.5% 47.0% 49.4% 42.6% 43.9% 47.1% 4.6% 1.7% s2xx 33.7% 32.4% 34.2% 33.2% 34.1% 33.4% 33.6% 33.5% 0.6% 0.2% s3xx 51.8% 52.3% 50.8% 52.2% 51.7% 51.9% 53.2% 52.0% 0.7% 0.3% s4xx 11.7% 13.1% 12.0% 11.6% 11.2% 10.3% 10.1% 11.4% 0.9% 0.4% s5xx 2.8% 2.3% 3.0% 2.9% 3.0% 4.4% 3.1% 3.1% 0.6% 0.2% Cleaned 21.3% 23.0% 17.6% 17.7% 20.7% 18.1% 16.9% 19.3% 2.2% 0.8% Sessions 37,634 31,731 32,159 28,750 36,087 35,848 32,117 33,475 2,896 1,094 63
  • 64.
    Access Patterns forRobots and Humans in Web Archives Very Small Standard Errors among Samples 64 Days Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7 Feb 8 Mean SD SE Duration 0:33:12 0:31:15 0:40:34 0:42:57 0:29:35 0:25:45 0:24:33 0:32:33 0:06:29 0:02:27 GET 98.4% 99.3% 97.7% 97.9% 99.4% 99.7% 99.8% 99% 0.8% 0.3% Embedded 47.4% 34.8% 43.7% 42.7% 41.9% 44.7% 46.8% 43.1% 3.9% 1.5% SI Robots 6.2% 12.0% 7.7% 7.7% 2.9% 3.5% 3.8% 6.3% 3.0% 1.1% NullRef 42.6% 56.6% 47.5% 47.0% 49.4% 42.6% 43.9% 47.1% 4.6% 1.7% s2xx 33.7% 32.4% 34.2% 33.2% 34.1% 33.4% 33.6% 33.5% 0.6% 0.2% s3xx 51.8% 52.3% 50.8% 52.2% 51.7% 51.9% 53.2% 52.0% 0.7% 0.3% s4xx 11.7% 13.1% 12.0% 11.6% 11.2% 10.3% 10.1% 11.4% 0.9% 0.4% s5xx 2.8% 2.3% 3.0% 2.9% 3.0% 4.4% 3.1% 3.1% 0.6% 0.2% Cleaned 21.3% 23.0% 17.6% 17.7% 20.7% 18.1% 16.9% 19.3% 2.2% 0.8% Sessions 37,634 31,731 32,159 28,750 36,087 35,848 32,117 33,475 2,896 1,094
  • 65.
    Access Patterns forRobots and Humans in Web Archives Feb. 2, 2012 sample is representative Days Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7 Feb 8 Mean SD SE Duration 0:33:12 0:31:15 0:40:34 0:42:57 0:29:35 0:25:45 0:24:33 0:32:33 0:06:29 0:02:27 GET 98.4% 99.3% 97.7% 97.9% 99.4% 99.7% 99.8% 99% 0.8% 0.3% Embedded 47.4% 34.8% 43.7% 42.7% 41.9% 44.7% 46.8% 43.1% 3.9% 1.5% SI Robots 6.2% 12.0% 7.7% 7.7% 2.9% 3.5% 3.8% 6.3% 3.0% 1.1% NullRef 42.6% 56.6% 47.5% 47.0% 49.4% 42.6% 43.9% 47.1% 4.6% 1.7% s2xx 33.7% 32.4% 34.2% 33.2% 34.1% 33.4% 33.6% 33.5% 0.6% 0.2% s3xx 51.8% 52.3% 50.8% 52.2% 51.7% 51.9% 53.2% 52.0% 0.7% 0.3% s4xx 11.7% 13.1% 12.0% 11.6% 11.2% 10.3% 10.1% 11.4% 0.9% 0.4% s5xx 2.8% 2.3% 3.0% 2.9% 3.0% 4.4% 3.1% 3.1% 0.6% 0.2% Cleaned 21.3% 23.0% 17.6% 17.7% 20.7% 18.1% 16.9% 19.3% 2.2% 0.8% Sessions 37,634 31,731 32,159 28,750 36,087 35,848 32,117 33,475 2,896 1,094 65
  • 66.
    Access Patterns forRobots and Humans in Web Archives Results of Data Cleaning • The records remained after cleaning are 21.3% of the requests in the raw file. 66
  • 67.
    Access Patterns forRobots and Humans in Web Archives Robots outnumber humans in terms of: 67 Sessions 10 1 Raw HTTP Accesses 5 4 MB Transferred 4 1 Users # Sessions # Requests (Raw) # Transferred MB Robots 34,203 (90.9%) 1,002,573 (50.1%) 20,010 Humans 3,431 (9.10%) 810,049 (40.5%) 4,459
  • 68.
    Access Patterns forRobots and Humans in Web Archives Humans exhibit Dip and Dive, while robots exhibit Dip and Skim 68 Robots Humans 328 Slides 571 Dives 1167 Slides 1942 Dives
  • 69.
    Access Patterns forRobots and Humans in Web Archives The total number of mementos available for 2011 was similar to previous years. 69