SlideShare a Scribd company logo
Access Patterns for Robots
and Humans in Web Archives
Yasmin AlNoamany, Michele C. Weigle, Michael L. Nelson
Computer Science Department
Old Dominion University, Norfolk, VA
yasmin@cs.odu.edu
Access Patterns for Robots and Humans in Web Archives
Access Patterns for Robots and Humans in Web Archives 2
0.204.48.255 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/ahrefs.comHTTP/1.0"200 96037 "-" "Mozilla/5.0(Windows NT 6.1; rv:10.0) Gecko/20100101
Firefox/10.0"
0.241.150.135 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/hperlinknow.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1)
Gecko/20100101Firefox/9.0.1"
0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20120202063732/http://b.scorecardresearch.com/beacon.jsHTTP/1.1"403 127
"http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77
Safari/535.7"
0.62.96.215 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://carbolicsmokeall.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
rv:1.9.2.26)Gecko/20120128BTRS87692Firefox/3.6.26( .NET CLR 3.5.30729; .NET4.0E)"
0.55.251.218 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/image/15_bar.gifHTTP/1.1"302 0
"http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/alltop.html""Mozilla/5.0(compatible;MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
0.123.255.46 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20090321190441im_/http://www.chermside.com/wp-content/uploads/bowlsclub.gifHTTP/1.1"302 0
"http://web.archive.org/web/20090321190441/http://www.chermside.com/""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77Safari/535.7"
0.73.170.52 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://www.pornhub.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2)
AppleWebKit/534.51.22(KHTML, like Gecko) Version/5.1.1Safari/534.51.22"
0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/bot_1.gifHTTP/1.1"302 0
"http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko)
Chrome/17.0.963.46Safari/535.11"
0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://liveweb.archive.org/http://photos.modelmayhem.com/avatars/1/9/3/2/6/7/4f1e2fb2e4ed4_t.jpgHTTP/1.1"200 7682
"http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77
Safari/535.7"
0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/sub_right.gifHTTP/1.1"302 0
"http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko)
Chrome/17.0.963.46Safari/535.11"
0.227.26.32 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.friendscafe.orgHTTP/1.1"200 102279 "-" "Mozilla/5.0 (Windows NT 5.1; rv:9.0.1)
Gecko/20100101Firefox/9.0.1"
0.29.194.93 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://tw.18dao.netHTTP/1.1"200 96951 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW;
rv:1.9.2.25)Gecko/20111212 AlexaToolbar/alxf-2.13Firefox/3.6.25( .NET CLR 3.5.30729)"
0.90.22.18 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.bookingbug.comHTTP/1.1"200 104622 "-" "Mozilla/5.0 (Windows; U; Windows NT
6.1; en-US; rv:1.9.2.18)Gecko/20110614Firefox/3.6.18"
0.7.73.16 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.orghttp://web.archive.org/web/20070930062203/http://profiles.yahoo.com/powertrip_02HTTP/1.1"302 0 "-"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.15)Gecko/20110303Firefox/3.6.15& vbCrlfAccept:text/javascript,image/gif,image/x-xbitmap,image/jpeg,image/pjpeg,
application/x-shockwave-flash,application/vnd.ms-excel,applicati"
0.49.73.161 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20061230183944im_/http://www.i3dthemes.com/_images/icons/rss_small.jpgHTTP/1.1"302 0 "-"
"Mozilla/5.0"
…
Access Patterns for Robots and Humans in Web Archives 3
0.204.48.255 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/ahrefs.comHTTP/1.0"200 96037 "-" "Mozilla/5.0(Windows NT 6.1; rv:10.0) Gecko/20100101
Firefox/10.0"
0.241.150.135 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/hperlinknow.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1)
Gecko/20100101Firefox/9.0.1"
0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20120202063732/http://b.scorecardresearch.com/beacon.jsHTTP/1.1"403 127
"http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77
Safari/535.7"
0.62.96.215 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://carbolicsmokeall.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
rv:1.9.2.26)Gecko/20120128BTRS87692Firefox/3.6.26( .NET CLR 3.5.30729; .NET4.0E)"
0.55.251.218 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/image/15_bar.gifHTTP/1.1"302 0
"http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/alltop.html""Mozilla/5.0(compatible;MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
0.123.255.46 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20090321190441im_/http://www.chermside.com/wp-content/uploads/bowlsclub.gifHTTP/1.1"302 0
"http://web.archive.org/web/20090321190441/http://www.chermside.com/""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77Safari/535.7"
0.73.170.52 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://www.pornhub.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2)
AppleWebKit/534.51.22(KHTML, like Gecko) Version/5.1.1Safari/534.51.22"
0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/bot_1.gifHTTP/1.1"302 0
"http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko)
Chrome/17.0.963.46Safari/535.11"
0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://liveweb.archive.org/http://photos.modelmayhem.com/avatars/1/9/3/2/6/7/4f1e2fb2e4ed4_t.jpgHTTP/1.1"200 7682
"http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77
Safari/535.7"
0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/sub_right.gifHTTP/1.1"302 0
"http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko)
Chrome/17.0.963.46Safari/535.11"
0.227.26.32 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.friendscafe.orgHTTP/1.1"200 102279 "-" "Mozilla/5.0 (Windows NT 5.1; rv:9.0.1)
Gecko/20100101Firefox/9.0.1"
0.29.194.93 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://tw.18dao.netHTTP/1.1"200 96951 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW;
rv:1.9.2.25)Gecko/20111212 AlexaToolbar/alxf-2.13Firefox/3.6.25( .NET CLR 3.5.30729)"
0.90.22.18 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.bookingbug.comHTTP/1.1"200 104622 "-" "Mozilla/5.0 (Windows; U; Windows NT
6.1; en-US; rv:1.9.2.18)Gecko/20110614Firefox/3.6.18"
0.7.73.16 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.orghttp://web.archive.org/web/20070930062203/http://profiles.yahoo.com/powertrip_02HTTP/1.1"302 0 "-"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.15)Gecko/20110303Firefox/3.6.15& vbCrlfAccept:text/javascript,image/gif,image/x-xbitmap,image/jpeg,image/pjpeg,
application/x-shockwave-flash,application/vnd.ms-excel,applicati"
0.49.73.161 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20061230183944im_/http://www.i3dthemes.com/_images/icons/rss_small.jpgHTTP/1.1"302 0 "-"
"Mozilla/5.0"
…
Access Patterns for Robots and Humans in Web Archives
Motivation
• There have been many studies for web access
patterns
• This is the first study using Internet Archive’s
web server logs to discover how users access
web archives
4
Access Patterns for Robots and Humans in Web Archives
Research Question
• How do users, both humans and robots,
access web archives?
5
Access Patterns for Robots and Humans in Web Archives
Methodology
6
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
7
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
0.247.222.86 - - [02/Feb/2012:07:03:55 +0000] "GET
http://web.archive.org/web/20130318135600/http://www.cnn.com HTTP/1.1"
200 18875 "http://wayback.archive.org/web/*/http://www.aura.vu"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7
(KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7"}
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
8
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
9
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
IPs had been anonymized by Internet Archive
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
10
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
11
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
12
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/*/http://www.cnn.com
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
13
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/*/http://www.cnn.com
TimeMap
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
14
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/20130318135600/http://www.cnn.com
0.247.222.86 - - [02/Feb/2012:07:03:55 +0000] "GET
http://web.archive.org/web/20130318135600/http://www.cnn.com/
HTTP/1.1" 200 18875
"http://wayback.archive.org/web/*/http://www.cnn.com"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77
Safari/535.7"} Memento
TimeMap
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
15
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/*/http://www.cnn.com
• Protocol: HTTP/1.1
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
16
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/*/http://www.cnn.com
• Protocol: HTTP/1.1
• HTTP status code: 200
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
17
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/*/http://www.cnn.com
• Protocol: HTTP/1.1
• HTTP status code: 200
• Bytes sent: 96433
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
18
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/*/http://www.cnn.com
• Protocol: HTTP/1.1
• HTTP status code: 200
• Bytes sent: 96433
• Referring URI: http://www.archive.org/web/web.php
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
19
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/*/http://www.cnn.com
• Protocol: HTTP/1.1
• HTTP status code: 200
• Bytes sent: 96433
• Referring URI: http://www.archive.org/web/web.php
• User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7
(KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7
Access Patterns for Robots and Humans in Web Archives
Dataset
• More than 82 million requests per day come
to the Wayback Machine
• Cluster Sampling: a week, Feb. 2-8, 2012
• Random Sampling: random slice (2 million
requests) from each day of the week
• We looked at all these days and found that 2
Feb. is a representative sample
– For details, look at Section 4.2 and Table 3 in the
paper
20
Access Patterns for Robots and Humans in Web Archives
Pre-processing
• Data Cleaning
• Session Identification
• Robot Detection
21
Access Patterns for Robots and Humans in Web Archives
Data Cleaning
22
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
http://web.archive.org/web/20070519015308/
http://www.jcdl.org/
Access Patterns for Robots and Humans in Web Archives
Embedded Resources
23
http://web.archive.org/web/20070519015308/
http://www.jcdl.org/
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
Access Patterns for Robots and Humans in Web Archives
Embedded Resources
24
http://web.archive.org/web/20070519015308/
http://www.jcdl.org/
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
Access Patterns for Robots and Humans in Web Archives
Static Resources
25
http://web.archive.org/web/20070519015308/
http://www.jcdl.org/
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
Access Patterns for Robots and Humans in Web Archives
Static Resources
26
http://web.archive.org/web/20070519015308/
http://www.jcdl.org/
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
Access Patterns for Robots and Humans in Web Archives
Invalid requests
27
http://web.archive.org/web/20100102003557/
about:blank
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
Access Patterns for Robots and Humans in Web Archives
Invalid requests
28
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
http://web.archive.org/web/20100102003557/
about:blank
Access Patterns for Robots and Humans in Web Archives
Requests that had 3xx status code
29
http://web.archive.org/web/20130114160045/
http://www.jcdl.org/0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
Access Patterns for Robots and Humans in Web Archives
Requests that had 3xx status code
30
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
http://web.archive.org/web/20130114160045/
http://www.jcdl.org/
curl -I "http://web.archive.org/web/20140004100000/http://www.jcdl.org/"
HTTP/1.1 302 Moved Temporarily
Server: Tengine/1.4.3
Date: Tue, 02 Jul 2013 19:48:59 GMT
Content-Type: application/octet-stream
Content-Length: 0
Connection: keep-alive
set-cookie: wayback_server=10; Domain=archive.org; Path=/; Expires=Thu, 01-Aug-13 19:48:59 GMT;
Location: /web/20130114160045/http://www.jcdl.org/
Access Patterns for Robots and Humans in Web Archives
Requests that had 3xx status code
31
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/ HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
http://web.archive.org/web/20130114160045/
http://www.jcdl.org/
Access Patterns for Robots and Humans in Web Archives
Session: set of web pages requested
by a particular user
32
1 mins 4 mins
3 mins 9 mins
p1 p2 p3
p4 p5
Access Patterns for Robots and Humans in Web Archives
Session: set of web pages requested
by a particular user
33
1 mins 4 mins
3 mins 9 mins
p1 p2 p3
p4 p5
Time between two
requests ≤ 10
Access Patterns for Robots and Humans in Web Archives
Session Identification
• Grouping: based on the IP and User-
Agent
• Threshold timeout: 10 minutes Liu et al. 2007,
Spiliopoulou et al. 2003
34
Access Patterns for Robots and Humans in Web Archives
Robot Detection is a big challenge
35
I’m not a
robot
Access Patterns for Robots and Humans in Web Archives
Distinguishing Robots from
Humans
36
Access Patterns for Robots and Humans in Web Archives
User-Agent Check
0.182.141.149 - -
[02/Feb/2012:00:01:51 +0000] "GET
http://wayback.archive.org/web/199906
01000000*/http://www.belizefirst.com/
HTTP/1.0" 200 98507 "-"
"Python-urllib/1.17"
37
Access Patterns for Robots and Humans in Web Archives
Number of User-Agent per IP
38
Access Patterns for Robots and Humans in Web Archives
Number of User-Agent per IP
39
One IP with User-Agent ≥20 = lying Robot
Access Patterns for Robots and Humans in Web Archives
Robots.txt file
• Session that contains an access for robot.txt is
a robot
40
0.182.141.149 - - [02/Feb/2012:06:20:46 +0000] "GET
http://web.archive.org/robots.txt HTTP/1.0" 200 125 "-"
"Mozilla/5.0 (compatible; MJ12bot/v1.4.1;
http://www.majestic12.co.uk/bot.php?+)"
0.182.141.149 - - [02/Feb/2012:06:20:19 +0000] "GET
http://wayback.archive.org/web/*/http://www.devilscafe.in
HTTP/1.1" 404 2168 "-" "Mozilla/5.0 (compatible;
MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)"
0.182.141.149 - - [02/Feb/2012:06:21:19 +0000] "GET
http://wayback.archive.org/web/*/http://www.genie.co.il
HTTP/1.1" 200 96205 "-" "Mozilla/5.0 (compatible;
MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)"
Access Patterns for Robots and Humans in Web Archives
6 requests, 2 seconds  robot
41
0.182.141.149 - - [02/Feb/2012:07:00:01 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 106433 “-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
0.182.141.149 - - [02/Feb/2012:07:00:01 +0000] "GET
http://wayback.archive.org/web/*/http://www.bbc.com HTTP/1.1" 200 566433 "-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET
http://wayback.archive.org/web/*/http://www.google.com HTTP/1.1" 200 96433 "-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET
http://wayback.archive.org/web/*/http://www.yahoo.com HTTP/1.1" 200 933333 "-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET
http://wayback.archive.org/web/*/http://www.bing.com HTTP/1.1" 200 964333 “-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
0.182.141.149 - - [02/Feb/2012:07:00:3 +0000] "GET
http://wayback.archive.org/web/*/http://www.jcdl.org HTTP/1.1" 200 123233 “-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
Access Patterns for Robots and Humans in Web Archives
3 requests, 520 seconds
(9 minutes)  human
42
0.11.160.13 - - [02/Feb/2012:07:00:00 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 106433 "-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
0.11.160.13 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/20100330042821/http://www.cnn.com HTTP/1.1" 200
566433 " http://wayback.archive.org/web/*/http://www.cnn.com" "Mozilla/5.0
(Macintosh; Intel Mac OS X 10_6_8)
0.11.160.13 - - [02/Feb/2012:07:08:00 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "
http://wayback.archive.org/web/*/http://www.cnn.com" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8)
Access Patterns for Robots and Humans in Web Archives
0.5 is a Good Browsing Speed Threshold
for Distinguishing Robots and Humans (Nithya
et al. 2012 , Reddy et al. 2012)
43
Browsing Speed (BS)
BS =
𝑠𝑒𝑠𝑠𝑖𝑜𝑛 𝑙𝑒𝑛𝑔𝑡ℎ
𝑠𝑒𝑠𝑠𝑖𝑜𝑛 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛
𝐵𝑆 =
≤ 0.5 𝐻𝑢𝑚𝑎𝑛𝑠
> 0.5 𝑅𝑜𝑏𝑜𝑡𝑠
Access Patterns for Robots and Humans in Web Archives
Image-to-HTML Ratio
44
If I download
these, I’m
not a robot
Access Patterns for Robots and Humans in Web Archives
Image-to-HTML Ratio
• The ratio between the number of image files
and the number of HTML files per session
• Robots sessions are less than 1:10 image to
HTML ratio, as suggested by Stassopoulou et al. 2005
45
Access Patterns for Robots and Humans in Web Archives
Image-to-HTML is the best in
detecting robots
46
Access Patterns for Robots and Humans in Web Archives
Traffic Analysis
• Records remaining after cleaning: 21.3%
(426,317 out of 2M)
• Unique IPs: 21,932
• Users: 33,841
• Sessions: 37,634
47
Access Patterns for Robots and Humans in Web Archives
Robots have longer sessions
than humans
48
Access Patterns for Robots and Humans in Web Archives
Humans spend more time
than Robots
49
Access Patterns for Robots and Humans in Web Archives
Robots outnumber humans
in terms of:
50
Sessions
10
1
Raw HTTP
Accesses
5
4
MB
Transferred
4
1
Access Patterns for Robots and Humans in Web Archives
User Access Patterns in
Web Archives
• Dip
• Dive
• Slide
• Skim
51
Access Patterns for Robots and Humans in Web Archives
Dip: simple access to
TimeMap or memento
52
TimeMap Memento
Access Patterns for Robots and Humans in Web Archives
Dive: different pages at approximately
the same archive time
53
November 12, 2009 11:55:54
November 12, 2009 05:37:22
November 12, 2009 05:38:02
Access Patterns for Robots and Humans in Web Archives
Slide: the same page at different
archive times
54
March 18, 2013 13:56:00 November 15, 2009 05:33:01 July 31, 2006 23:55:45
Access Patterns for Robots and Humans in Web Archives
Skim: lists of TimeMaps
55
http://web.archive.org/web/*/
http://cnn.com/
http://web.archive.org/web/*/
http://www.bbc.com/
http://web.archive.org/web/*/
http://www.nytimes.com/
Access Patterns for Robots and Humans in Web Archives
Everybody Dips, Humans Dive,
Robots Skim
56
Robots (34,203 sessions) Humans (3,431 sessions)
Access Patterns for Robots and Humans in Web Archives
Pattern Length
57
Slide length = 4
Skim length = 3
Access Patterns for Robots and Humans in Web Archives
Small Medians, Large
Standard Deviations
58
Access Patterns for Robots and Humans in Web Archives
Only recent past exhibits
locality of reference
59
Access Patterns for Robots and Humans in Web Archives
Only recent past exhibits
locality of reference
60
Cache replacement
policies should
favor recent past
Access Patterns for Robots and Humans in Web Archives
Conclusions
• We introduced traffic analysis for the Wayback Machine
• We discovered that robots outnumber humans
– 10:1 in terms of sessions
– 5:4 in terms of raw, unfiltered requests
– 4:1 in terms of megabytes transferred
– Robots need APIs http://arxiv.org/abs/1305.5959
• We Identified four major web archive access patterns
– Dip
– Slide
– Dive
– Skim
• Only recent past exhibits locality of reference
61
Access Patterns for Robots and Humans in Web Archives
Extra Slides
62
Access Patterns for Robots and Humans in Web Archives
The Features of the Samples
Days Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7 Feb 8 Mean SD SE
Duration 0:33:12 0:31:15 0:40:34 0:42:57 0:29:35 0:25:45 0:24:33 0:32:33 0:06:29 0:02:27
GET 98.4% 99.3% 97.7% 97.9% 99.4% 99.7% 99.8% 99% 0.8% 0.3%
Embedded 47.4% 34.8% 43.7% 42.7% 41.9% 44.7% 46.8% 43.1% 3.9% 1.5%
SI Robots 6.2% 12.0% 7.7% 7.7% 2.9% 3.5% 3.8% 6.3% 3.0% 1.1%
NullRef 42.6% 56.6% 47.5% 47.0% 49.4% 42.6% 43.9% 47.1% 4.6% 1.7%
s2xx 33.7% 32.4% 34.2% 33.2% 34.1% 33.4% 33.6% 33.5% 0.6% 0.2%
s3xx 51.8% 52.3% 50.8% 52.2% 51.7% 51.9% 53.2% 52.0% 0.7% 0.3%
s4xx 11.7% 13.1% 12.0% 11.6% 11.2% 10.3% 10.1% 11.4% 0.9% 0.4%
s5xx 2.8% 2.3% 3.0% 2.9% 3.0% 4.4% 3.1% 3.1% 0.6% 0.2%
Cleaned 21.3% 23.0% 17.6% 17.7% 20.7% 18.1% 16.9% 19.3% 2.2% 0.8%
Sessions 37,634 31,731 32,159 28,750 36,087 35,848 32,117 33,475 2,896 1,094
63
Access Patterns for Robots and Humans in Web Archives
Very Small Standard Errors among
Samples
64
Days Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7 Feb 8 Mean SD SE
Duration 0:33:12 0:31:15 0:40:34 0:42:57 0:29:35 0:25:45 0:24:33 0:32:33 0:06:29 0:02:27
GET 98.4% 99.3% 97.7% 97.9% 99.4% 99.7% 99.8% 99% 0.8% 0.3%
Embedded 47.4% 34.8% 43.7% 42.7% 41.9% 44.7% 46.8% 43.1% 3.9% 1.5%
SI Robots 6.2% 12.0% 7.7% 7.7% 2.9% 3.5% 3.8% 6.3% 3.0% 1.1%
NullRef 42.6% 56.6% 47.5% 47.0% 49.4% 42.6% 43.9% 47.1% 4.6% 1.7%
s2xx 33.7% 32.4% 34.2% 33.2% 34.1% 33.4% 33.6% 33.5% 0.6% 0.2%
s3xx 51.8% 52.3% 50.8% 52.2% 51.7% 51.9% 53.2% 52.0% 0.7% 0.3%
s4xx 11.7% 13.1% 12.0% 11.6% 11.2% 10.3% 10.1% 11.4% 0.9% 0.4%
s5xx 2.8% 2.3% 3.0% 2.9% 3.0% 4.4% 3.1% 3.1% 0.6% 0.2%
Cleaned 21.3% 23.0% 17.6% 17.7% 20.7% 18.1% 16.9% 19.3% 2.2% 0.8%
Sessions 37,634 31,731 32,159 28,750 36,087 35,848 32,117 33,475 2,896 1,094
Access Patterns for Robots and Humans in Web Archives
Feb. 2, 2012 sample is representative
Days Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7 Feb 8 Mean SD SE
Duration 0:33:12 0:31:15 0:40:34 0:42:57 0:29:35 0:25:45 0:24:33 0:32:33 0:06:29 0:02:27
GET 98.4% 99.3% 97.7% 97.9% 99.4% 99.7% 99.8% 99% 0.8% 0.3%
Embedded 47.4% 34.8% 43.7% 42.7% 41.9% 44.7% 46.8% 43.1% 3.9% 1.5%
SI Robots 6.2% 12.0% 7.7% 7.7% 2.9% 3.5% 3.8% 6.3% 3.0% 1.1%
NullRef 42.6% 56.6% 47.5% 47.0% 49.4% 42.6% 43.9% 47.1% 4.6% 1.7%
s2xx 33.7% 32.4% 34.2% 33.2% 34.1% 33.4% 33.6% 33.5% 0.6% 0.2%
s3xx 51.8% 52.3% 50.8% 52.2% 51.7% 51.9% 53.2% 52.0% 0.7% 0.3%
s4xx 11.7% 13.1% 12.0% 11.6% 11.2% 10.3% 10.1% 11.4% 0.9% 0.4%
s5xx 2.8% 2.3% 3.0% 2.9% 3.0% 4.4% 3.1% 3.1% 0.6% 0.2%
Cleaned 21.3% 23.0% 17.6% 17.7% 20.7% 18.1% 16.9% 19.3% 2.2% 0.8%
Sessions 37,634 31,731 32,159 28,750 36,087 35,848 32,117 33,475 2,896 1,094
65
Access Patterns for Robots and Humans in Web Archives
Results of Data Cleaning
• The records remained after cleaning are 21.3%
of the requests in the raw file.
66
Access Patterns for Robots and Humans in Web Archives
Robots outnumber humans
in terms of:
67
Sessions
10
1
Raw HTTP
Accesses
5
4
MB
Transferred
4
1
Users # Sessions # Requests
(Raw)
# Transferred MB
Robots 34,203
(90.9%)
1,002,573
(50.1%)
20,010
Humans 3,431
(9.10%)
810,049
(40.5%)
4,459
Access Patterns for Robots and Humans in Web Archives
Humans exhibit Dip and Dive,
while robots exhibit Dip and Skim
68
Robots Humans
328 Slides
571 Dives
1167
Slides
1942
Dives
Access Patterns for Robots and Humans in Web Archives
The total number of mementos available
for 2011 was similar to previous years.
69

More Related Content

Viewers also liked

Denise's tribute from E243!
Denise's tribute from E243!Denise's tribute from E243!
Denise's tribute from E243!
guest6a16d9
 
859 0708fedlegalempguide
859 0708fedlegalempguide859 0708fedlegalempguide
859 0708fedlegalempguide
legal5
 
Avoiding Malpractice Conflicts Of Interest In Bankruptcy ...
Avoiding Malpractice Conflicts Of Interest In Bankruptcy ...Avoiding Malpractice Conflicts Of Interest In Bankruptcy ...
Avoiding Malpractice Conflicts Of Interest In Bankruptcy ...
legal5
 
Dry1
Dry1Dry1
Environment for Access
Environment for AccessEnvironment for Access
Environment for Access
Keira Dooley
 
Breaking bridges 2013
Breaking bridges 2013Breaking bridges 2013
Breaking bridges 2013
robert ponzio
 
Menighedsudvikling 5 - hvor er vi
Menighedsudvikling   5 - hvor er viMenighedsudvikling   5 - hvor er vi
Menighedsudvikling 5 - hvor er viMogens Mogensen
 
So You Want To Be A Consultant July 2009 Published
So You Want To Be A Consultant July 2009 PublishedSo You Want To Be A Consultant July 2009 Published
So You Want To Be A Consultant July 2009 Published
jimlove
 
River Crossing
River CrossingRiver Crossing
River Crossing
nicoleslaski
 
03 Song Dynasty Outline
03 Song Dynasty Outline03 Song Dynasty Outline
03 Song Dynasty Outline
robert ponzio
 
Ceo Chairman Peer Exchange Presentation
Ceo Chairman Peer Exchange PresentationCeo Chairman Peer Exchange Presentation
Ceo Chairman Peer Exchange Presentation
eap4j
 
Mangosteen key performance indicators
Mangosteen key performance indicatorsMangosteen key performance indicators
Mangosteen key performance indicators
Retno Astuti
 
Introduktion Til LæRingsnetvæRk
Introduktion Til LæRingsnetvæRkIntroduktion Til LæRingsnetvæRk
Introduktion Til LæRingsnetvæRkMogens Mogensen
 
Making social media monitoring and analytics work for your brand
Making social media monitoring and analytics work for your brandMaking social media monitoring and analytics work for your brand
Making social media monitoring and analytics work for your brand
Marketwired
 
For IP Communications, Ubiquity is Dead
For IP Communications, Ubiquity is DeadFor IP Communications, Ubiquity is Dead
For IP Communications, Ubiquity is Dead
Dean Bubley
 
Posrednichstvo
PosrednichstvoPosrednichstvo
Posrednichstvodronzina
 
ITU Telecom 2013 Workshop: New Telecom Opportunities in Voice and Messaging
ITU Telecom 2013 Workshop: New Telecom Opportunities in Voice and MessagingITU Telecom 2013 Workshop: New Telecom Opportunities in Voice and Messaging
ITU Telecom 2013 Workshop: New Telecom Opportunities in Voice and Messaging
Dean Bubley
 
Discovery Day 3 09
Discovery Day 3 09Discovery Day 3 09
Discovery Day 3 09
RealtyExecutivesMidwest
 
Anerkendelse og mission-ppp
Anerkendelse og mission-pppAnerkendelse og mission-ppp
Anerkendelse og mission-pppMogens Mogensen
 

Viewers also liked (20)

Denise's tribute from E243!
Denise's tribute from E243!Denise's tribute from E243!
Denise's tribute from E243!
 
859 0708fedlegalempguide
859 0708fedlegalempguide859 0708fedlegalempguide
859 0708fedlegalempguide
 
Avoiding Malpractice Conflicts Of Interest In Bankruptcy ...
Avoiding Malpractice Conflicts Of Interest In Bankruptcy ...Avoiding Malpractice Conflicts Of Interest In Bankruptcy ...
Avoiding Malpractice Conflicts Of Interest In Bankruptcy ...
 
Dry1
Dry1Dry1
Dry1
 
Environment for Access
Environment for AccessEnvironment for Access
Environment for Access
 
Breaking bridges 2013
Breaking bridges 2013Breaking bridges 2013
Breaking bridges 2013
 
Menighedsudvikling 5 - hvor er vi
Menighedsudvikling   5 - hvor er viMenighedsudvikling   5 - hvor er vi
Menighedsudvikling 5 - hvor er vi
 
So You Want To Be A Consultant July 2009 Published
So You Want To Be A Consultant July 2009 PublishedSo You Want To Be A Consultant July 2009 Published
So You Want To Be A Consultant July 2009 Published
 
River Crossing
River CrossingRiver Crossing
River Crossing
 
5 Troen
5   Troen5   Troen
5 Troen
 
03 Song Dynasty Outline
03 Song Dynasty Outline03 Song Dynasty Outline
03 Song Dynasty Outline
 
Ceo Chairman Peer Exchange Presentation
Ceo Chairman Peer Exchange PresentationCeo Chairman Peer Exchange Presentation
Ceo Chairman Peer Exchange Presentation
 
Mangosteen key performance indicators
Mangosteen key performance indicatorsMangosteen key performance indicators
Mangosteen key performance indicators
 
Introduktion Til LæRingsnetvæRk
Introduktion Til LæRingsnetvæRkIntroduktion Til LæRingsnetvæRk
Introduktion Til LæRingsnetvæRk
 
Making social media monitoring and analytics work for your brand
Making social media monitoring and analytics work for your brandMaking social media monitoring and analytics work for your brand
Making social media monitoring and analytics work for your brand
 
For IP Communications, Ubiquity is Dead
For IP Communications, Ubiquity is DeadFor IP Communications, Ubiquity is Dead
For IP Communications, Ubiquity is Dead
 
Posrednichstvo
PosrednichstvoPosrednichstvo
Posrednichstvo
 
ITU Telecom 2013 Workshop: New Telecom Opportunities in Voice and Messaging
ITU Telecom 2013 Workshop: New Telecom Opportunities in Voice and MessagingITU Telecom 2013 Workshop: New Telecom Opportunities in Voice and Messaging
ITU Telecom 2013 Workshop: New Telecom Opportunities in Voice and Messaging
 
Discovery Day 3 09
Discovery Day 3 09Discovery Day 3 09
Discovery Day 3 09
 
Anerkendelse og mission-ppp
Anerkendelse og mission-pppAnerkendelse og mission-ppp
Anerkendelse og mission-ppp
 

Similar to Access Patterns for Robots and Humans in Web Archives

OSDC 2015: Pere Urbon | Scaling Logstash: A Collection of War Stories
OSDC 2015: Pere Urbon | Scaling Logstash: A Collection of War StoriesOSDC 2015: Pere Urbon | Scaling Logstash: A Collection of War Stories
OSDC 2015: Pere Urbon | Scaling Logstash: A Collection of War Stories
NETWAYS
 
Bradley Horowitz of Yahoo at FOWA 2007
Bradley Horowitz of Yahoo at FOWA 2007Bradley Horowitz of Yahoo at FOWA 2007
Bradley Horowitz of Yahoo at FOWA 2007
randomfromtheweb
 
Who and What Links to the Internet Archive
Who and What Links to the Internet ArchiveWho and What Links to the Internet Archive
Who and What Links to the Internet Archive
Michael Nelson
 
Streaming Data Analytics with Amazon Redshift and Kinesis Firehose
Streaming Data Analytics with Amazon Redshift and Kinesis FirehoseStreaming Data Analytics with Amazon Redshift and Kinesis Firehose
Streaming Data Analytics with Amazon Redshift and Kinesis Firehose
Amazon Web Services
 
Real-time data analysis using ELK
Real-time data analysis using ELKReal-time data analysis using ELK
Real-time data analysis using ELK
Jettro Coenradie
 
Streaming Data Analytics with Amazon Kinesis Firehose and Redshift
Streaming Data Analytics with Amazon Kinesis Firehose and RedshiftStreaming Data Analytics with Amazon Kinesis Firehose and Redshift
Streaming Data Analytics with Amazon Kinesis Firehose and Redshift
Amazon Web Services
 
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
Puppet
 
Developers’ mDay 2019. - Nikola Krgović, Twin Star Systems – Big Data for Dev...
Developers’ mDay 2019. - Nikola Krgović, Twin Star Systems – Big Data for Dev...Developers’ mDay 2019. - Nikola Krgović, Twin Star Systems – Big Data for Dev...
Developers’ mDay 2019. - Nikola Krgović, Twin Star Systems – Big Data for Dev...
mCloud
 
Puppet Camp Berlin 2014 Closing Keynote: Next steps for doing more awesome th...
Puppet Camp Berlin 2014 Closing Keynote: Next steps for doing more awesome th...Puppet Camp Berlin 2014 Closing Keynote: Next steps for doing more awesome th...
Puppet Camp Berlin 2014 Closing Keynote: Next steps for doing more awesome th...
Puppet
 
Elastic stack
Elastic stackElastic stack
Elastic stack
Minsoo Jun
 
Flourish2011
Flourish2011Flourish2011
Flourish2011
Mark Meeker
 
Алексей Колосов - Drupal для хостинга
Алексей Колосов - Drupal для хостингаАлексей Колосов - Drupal для хостинга
Алексей Колосов - Drupal для хостинга
DrupalSPB
 
WebSocket - May 2011
WebSocket - May 2011WebSocket - May 2011
WebSocket - May 2011
takanao ENODH
 
1 Web Page Foundations Overview This lab walk.docx
1  Web Page Foundations Overview This lab walk.docx1  Web Page Foundations Overview This lab walk.docx
1 Web Page Foundations Overview This lab walk.docx
honey725342
 
LogStash - Yes, logging can be awesome
LogStash - Yes, logging can be awesomeLogStash - Yes, logging can be awesome
LogStash - Yes, logging can be awesome
James Turnbull
 
JoomlaChicago - Loop - February 2012 Presentation
JoomlaChicago - Loop - February 2012 PresentationJoomlaChicago - Loop - February 2012 Presentation
JoomlaChicago - Loop - February 2012 Presentation
JoomlaChicago - Loop
 
User agents
User agentsUser agents
User agents
Neoone Ann
 
The Web Becomes Graceful
The Web Becomes GracefulThe Web Becomes Graceful
The Web Becomes Graceful
colorhook
 
Web Performance Optimierung - DWX13
Web Performance Optimierung - DWX13Web Performance Optimierung - DWX13
Web Performance Optimierung - DWX13
Walter Ebert
 
HTTP 2.0 - Web Unleashed 2015
HTTP 2.0 - Web Unleashed 2015HTTP 2.0 - Web Unleashed 2015
HTTP 2.0 - Web Unleashed 2015
dmethvin
 

Similar to Access Patterns for Robots and Humans in Web Archives (20)

OSDC 2015: Pere Urbon | Scaling Logstash: A Collection of War Stories
OSDC 2015: Pere Urbon | Scaling Logstash: A Collection of War StoriesOSDC 2015: Pere Urbon | Scaling Logstash: A Collection of War Stories
OSDC 2015: Pere Urbon | Scaling Logstash: A Collection of War Stories
 
Bradley Horowitz of Yahoo at FOWA 2007
Bradley Horowitz of Yahoo at FOWA 2007Bradley Horowitz of Yahoo at FOWA 2007
Bradley Horowitz of Yahoo at FOWA 2007
 
Who and What Links to the Internet Archive
Who and What Links to the Internet ArchiveWho and What Links to the Internet Archive
Who and What Links to the Internet Archive
 
Streaming Data Analytics with Amazon Redshift and Kinesis Firehose
Streaming Data Analytics with Amazon Redshift and Kinesis FirehoseStreaming Data Analytics with Amazon Redshift and Kinesis Firehose
Streaming Data Analytics with Amazon Redshift and Kinesis Firehose
 
Real-time data analysis using ELK
Real-time data analysis using ELKReal-time data analysis using ELK
Real-time data analysis using ELK
 
Streaming Data Analytics with Amazon Kinesis Firehose and Redshift
Streaming Data Analytics with Amazon Kinesis Firehose and RedshiftStreaming Data Analytics with Amazon Kinesis Firehose and Redshift
Streaming Data Analytics with Amazon Kinesis Firehose and Redshift
 
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
 
Developers’ mDay 2019. - Nikola Krgović, Twin Star Systems – Big Data for Dev...
Developers’ mDay 2019. - Nikola Krgović, Twin Star Systems – Big Data for Dev...Developers’ mDay 2019. - Nikola Krgović, Twin Star Systems – Big Data for Dev...
Developers’ mDay 2019. - Nikola Krgović, Twin Star Systems – Big Data for Dev...
 
Puppet Camp Berlin 2014 Closing Keynote: Next steps for doing more awesome th...
Puppet Camp Berlin 2014 Closing Keynote: Next steps for doing more awesome th...Puppet Camp Berlin 2014 Closing Keynote: Next steps for doing more awesome th...
Puppet Camp Berlin 2014 Closing Keynote: Next steps for doing more awesome th...
 
Elastic stack
Elastic stackElastic stack
Elastic stack
 
Flourish2011
Flourish2011Flourish2011
Flourish2011
 
Алексей Колосов - Drupal для хостинга
Алексей Колосов - Drupal для хостингаАлексей Колосов - Drupal для хостинга
Алексей Колосов - Drupal для хостинга
 
WebSocket - May 2011
WebSocket - May 2011WebSocket - May 2011
WebSocket - May 2011
 
1 Web Page Foundations Overview This lab walk.docx
1  Web Page Foundations Overview This lab walk.docx1  Web Page Foundations Overview This lab walk.docx
1 Web Page Foundations Overview This lab walk.docx
 
LogStash - Yes, logging can be awesome
LogStash - Yes, logging can be awesomeLogStash - Yes, logging can be awesome
LogStash - Yes, logging can be awesome
 
JoomlaChicago - Loop - February 2012 Presentation
JoomlaChicago - Loop - February 2012 PresentationJoomlaChicago - Loop - February 2012 Presentation
JoomlaChicago - Loop - February 2012 Presentation
 
User agents
User agentsUser agents
User agents
 
The Web Becomes Graceful
The Web Becomes GracefulThe Web Becomes Graceful
The Web Becomes Graceful
 
Web Performance Optimierung - DWX13
Web Performance Optimierung - DWX13Web Performance Optimierung - DWX13
Web Performance Optimierung - DWX13
 
HTTP 2.0 - Web Unleashed 2015
HTTP 2.0 - Web Unleashed 2015HTTP 2.0 - Web Unleashed 2015
HTTP 2.0 - Web Unleashed 2015
 

More from Yasmin AlNoamany, PhD

A Guide for Reproducible Research
A Guide for Reproducible ResearchA Guide for Reproducible Research
A Guide for Reproducible Research
Yasmin AlNoamany, PhD
 
Software as a Well-Formed Research Object
Software as a Well-Formed Research ObjectSoftware as a Well-Formed Research Object
Software as a Well-Formed Research Object
Yasmin AlNoamany, PhD
 
csvconfyasmin2017_05_03
csvconfyasmin2017_05_03csvconfyasmin2017_05_03
csvconfyasmin2017_05_03
Yasmin AlNoamany, PhD
 
Data curation vanderbilt
Data curation vanderbiltData curation vanderbilt
Data curation vanderbilt
Yasmin AlNoamany, PhD
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...
Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...
Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...
Yasmin AlNoamany, PhD
 
Generating stories from Archive-It collections
Generating stories from Archive-It collectionsGenerating stories from Archive-It collections
Generating stories from Archive-It collections
Yasmin AlNoamany, PhD
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingUsing Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Yasmin AlNoamany, PhD
 
Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web Archives
Yasmin AlNoamany, PhD
 
Characteristics of Social Media Stories
Characteristics of Social Media StoriesCharacteristics of Social Media Stories
Characteristics of Social Media Stories
Yasmin AlNoamany, PhD
 
Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web Archives
Yasmin AlNoamany, PhD
 
User Access Patterns in Web Archives
User Access Patterns in Web ArchivesUser Access Patterns in Web Archives
User Access Patterns in Web Archives
Yasmin AlNoamany, PhD
 
Who and What Links to the Internet Archive
Who and What Links to the Internet ArchiveWho and What Links to the Internet Archive
Who and What Links to the Internet Archive
Yasmin AlNoamany, PhD
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich  the Live Web Experience Through StorytellingUsing Web Archives to Enrich  the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Yasmin AlNoamany, PhD
 
Access Patterns for Robots and Humans in Web Archives
Access Patterns for Robots and Humans in Web ArchivesAccess Patterns for Robots and Humans in Web Archives
Access Patterns for Robots and Humans in Web Archives
Yasmin AlNoamany, PhD
 

More from Yasmin AlNoamany, PhD (14)

A Guide for Reproducible Research
A Guide for Reproducible ResearchA Guide for Reproducible Research
A Guide for Reproducible Research
 
Software as a Well-Formed Research Object
Software as a Well-Formed Research ObjectSoftware as a Well-Formed Research Object
Software as a Well-Formed Research Object
 
csvconfyasmin2017_05_03
csvconfyasmin2017_05_03csvconfyasmin2017_05_03
csvconfyasmin2017_05_03
 
Data curation vanderbilt
Data curation vanderbiltData curation vanderbilt
Data curation vanderbilt
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...
Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...
Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...
 
Generating stories from Archive-It collections
Generating stories from Archive-It collectionsGenerating stories from Archive-It collections
Generating stories from Archive-It collections
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingUsing Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through Storytelling
 
Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web Archives
 
Characteristics of Social Media Stories
Characteristics of Social Media StoriesCharacteristics of Social Media Stories
Characteristics of Social Media Stories
 
Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web Archives
 
User Access Patterns in Web Archives
User Access Patterns in Web ArchivesUser Access Patterns in Web Archives
User Access Patterns in Web Archives
 
Who and What Links to the Internet Archive
Who and What Links to the Internet ArchiveWho and What Links to the Internet Archive
Who and What Links to the Internet Archive
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich  the Live Web Experience Through StorytellingUsing Web Archives to Enrich  the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through Storytelling
 
Access Patterns for Robots and Humans in Web Archives
Access Patterns for Robots and Humans in Web ArchivesAccess Patterns for Robots and Humans in Web Archives
Access Patterns for Robots and Humans in Web Archives
 

Recently uploaded

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 

Access Patterns for Robots and Humans in Web Archives

  • 1. Access Patterns for Robots and Humans in Web Archives Yasmin AlNoamany, Michele C. Weigle, Michael L. Nelson Computer Science Department Old Dominion University, Norfolk, VA yasmin@cs.odu.edu Access Patterns for Robots and Humans in Web Archives
  • 2. Access Patterns for Robots and Humans in Web Archives 2 0.204.48.255 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/ahrefs.comHTTP/1.0"200 96037 "-" "Mozilla/5.0(Windows NT 6.1; rv:10.0) Gecko/20100101 Firefox/10.0" 0.241.150.135 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/hperlinknow.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1) Gecko/20100101Firefox/9.0.1" 0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20120202063732/http://b.scorecardresearch.com/beacon.jsHTTP/1.1"403 127 "http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" 0.62.96.215 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://carbolicsmokeall.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.26)Gecko/20120128BTRS87692Firefox/3.6.26( .NET CLR 3.5.30729; .NET4.0E)" 0.55.251.218 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/image/15_bar.gifHTTP/1.1"302 0 "http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/alltop.html""Mozilla/5.0(compatible;MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" 0.123.255.46 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20090321190441im_/http://www.chermside.com/wp-content/uploads/bowlsclub.gifHTTP/1.1"302 0 "http://web.archive.org/web/20090321190441/http://www.chermside.com/""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77Safari/535.7" 0.73.170.52 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://www.pornhub.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/534.51.22(KHTML, like Gecko) Version/5.1.1Safari/534.51.22" 0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/bot_1.gifHTTP/1.1"302 0 "http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko) Chrome/17.0.963.46Safari/535.11" 0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://liveweb.archive.org/http://photos.modelmayhem.com/avatars/1/9/3/2/6/7/4f1e2fb2e4ed4_t.jpgHTTP/1.1"200 7682 "http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" 0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/sub_right.gifHTTP/1.1"302 0 "http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko) Chrome/17.0.963.46Safari/535.11" 0.227.26.32 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.friendscafe.orgHTTP/1.1"200 102279 "-" "Mozilla/5.0 (Windows NT 5.1; rv:9.0.1) Gecko/20100101Firefox/9.0.1" 0.29.194.93 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://tw.18dao.netHTTP/1.1"200 96951 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW; rv:1.9.2.25)Gecko/20111212 AlexaToolbar/alxf-2.13Firefox/3.6.25( .NET CLR 3.5.30729)" 0.90.22.18 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.bookingbug.comHTTP/1.1"200 104622 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.18)Gecko/20110614Firefox/3.6.18" 0.7.73.16 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.orghttp://web.archive.org/web/20070930062203/http://profiles.yahoo.com/powertrip_02HTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.15)Gecko/20110303Firefox/3.6.15& vbCrlfAccept:text/javascript,image/gif,image/x-xbitmap,image/jpeg,image/pjpeg, application/x-shockwave-flash,application/vnd.ms-excel,applicati" 0.49.73.161 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20061230183944im_/http://www.i3dthemes.com/_images/icons/rss_small.jpgHTTP/1.1"302 0 "-" "Mozilla/5.0" …
  • 3. Access Patterns for Robots and Humans in Web Archives 3 0.204.48.255 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/ahrefs.comHTTP/1.0"200 96037 "-" "Mozilla/5.0(Windows NT 6.1; rv:10.0) Gecko/20100101 Firefox/10.0" 0.241.150.135 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/hperlinknow.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1) Gecko/20100101Firefox/9.0.1" 0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20120202063732/http://b.scorecardresearch.com/beacon.jsHTTP/1.1"403 127 "http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" 0.62.96.215 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://carbolicsmokeall.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.26)Gecko/20120128BTRS87692Firefox/3.6.26( .NET CLR 3.5.30729; .NET4.0E)" 0.55.251.218 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/image/15_bar.gifHTTP/1.1"302 0 "http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/alltop.html""Mozilla/5.0(compatible;MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" 0.123.255.46 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20090321190441im_/http://www.chermside.com/wp-content/uploads/bowlsclub.gifHTTP/1.1"302 0 "http://web.archive.org/web/20090321190441/http://www.chermside.com/""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77Safari/535.7" 0.73.170.52 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://www.pornhub.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/534.51.22(KHTML, like Gecko) Version/5.1.1Safari/534.51.22" 0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/bot_1.gifHTTP/1.1"302 0 "http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko) Chrome/17.0.963.46Safari/535.11" 0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://liveweb.archive.org/http://photos.modelmayhem.com/avatars/1/9/3/2/6/7/4f1e2fb2e4ed4_t.jpgHTTP/1.1"200 7682 "http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" 0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/sub_right.gifHTTP/1.1"302 0 "http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko) Chrome/17.0.963.46Safari/535.11" 0.227.26.32 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.friendscafe.orgHTTP/1.1"200 102279 "-" "Mozilla/5.0 (Windows NT 5.1; rv:9.0.1) Gecko/20100101Firefox/9.0.1" 0.29.194.93 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://tw.18dao.netHTTP/1.1"200 96951 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW; rv:1.9.2.25)Gecko/20111212 AlexaToolbar/alxf-2.13Firefox/3.6.25( .NET CLR 3.5.30729)" 0.90.22.18 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.bookingbug.comHTTP/1.1"200 104622 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.18)Gecko/20110614Firefox/3.6.18" 0.7.73.16 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.orghttp://web.archive.org/web/20070930062203/http://profiles.yahoo.com/powertrip_02HTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.15)Gecko/20110303Firefox/3.6.15& vbCrlfAccept:text/javascript,image/gif,image/x-xbitmap,image/jpeg,image/pjpeg, application/x-shockwave-flash,application/vnd.ms-excel,applicati" 0.49.73.161 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20061230183944im_/http://www.i3dthemes.com/_images/icons/rss_small.jpgHTTP/1.1"302 0 "-" "Mozilla/5.0" …
  • 4. Access Patterns for Robots and Humans in Web Archives Motivation • There have been many studies for web access patterns • This is the first study using Internet Archive’s web server logs to discover how users access web archives 4
  • 5. Access Patterns for Robots and Humans in Web Archives Research Question • How do users, both humans and robots, access web archives? 5
  • 6. Access Patterns for Robots and Humans in Web Archives Methodology 6
  • 7. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 7 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" 0.247.222.86 - - [02/Feb/2012:07:03:55 +0000] "GET http://web.archive.org/web/20130318135600/http://www.cnn.com HTTP/1.1" 200 18875 "http://wayback.archive.org/web/*/http://www.aura.vu" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7"}
  • 8. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 8 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86
  • 9. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 9 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 IPs had been anonymized by Internet Archive
  • 10. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 10 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000
  • 11. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 11 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET
  • 12. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 12 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com
  • 13. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 13 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com TimeMap
  • 14. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 14 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/20130318135600/http://www.cnn.com 0.247.222.86 - - [02/Feb/2012:07:03:55 +0000] "GET http://web.archive.org/web/20130318135600/http://www.cnn.com/ HTTP/1.1" 200 18875 "http://wayback.archive.org/web/*/http://www.cnn.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7"} Memento TimeMap
  • 15. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 15 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com • Protocol: HTTP/1.1
  • 16. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 16 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com • Protocol: HTTP/1.1 • HTTP status code: 200
  • 17. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 17 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com • Protocol: HTTP/1.1 • HTTP status code: 200 • Bytes sent: 96433
  • 18. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 18 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com • Protocol: HTTP/1.1 • HTTP status code: 200 • Bytes sent: 96433 • Referring URI: http://www.archive.org/web/web.php
  • 19. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 19 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com • Protocol: HTTP/1.1 • HTTP status code: 200 • Bytes sent: 96433 • Referring URI: http://www.archive.org/web/web.php • User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7
  • 20. Access Patterns for Robots and Humans in Web Archives Dataset • More than 82 million requests per day come to the Wayback Machine • Cluster Sampling: a week, Feb. 2-8, 2012 • Random Sampling: random slice (2 million requests) from each day of the week • We looked at all these days and found that 2 Feb. is a representative sample – For details, look at Section 4.2 and Table 3 in the paper 20
  • 21. Access Patterns for Robots and Humans in Web Archives Pre-processing • Data Cleaning • Session Identification • Robot Detection 21
  • 22. Access Patterns for Robots and Humans in Web Archives Data Cleaning 22 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0" http://web.archive.org/web/20070519015308/ http://www.jcdl.org/
  • 23. Access Patterns for Robots and Humans in Web Archives Embedded Resources 23 http://web.archive.org/web/20070519015308/ http://www.jcdl.org/ 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
  • 24. Access Patterns for Robots and Humans in Web Archives Embedded Resources 24 http://web.archive.org/web/20070519015308/ http://www.jcdl.org/ 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
  • 25. Access Patterns for Robots and Humans in Web Archives Static Resources 25 http://web.archive.org/web/20070519015308/ http://www.jcdl.org/ 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
  • 26. Access Patterns for Robots and Humans in Web Archives Static Resources 26 http://web.archive.org/web/20070519015308/ http://www.jcdl.org/ 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
  • 27. Access Patterns for Robots and Humans in Web Archives Invalid requests 27 http://web.archive.org/web/20100102003557/ about:blank 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
  • 28. Access Patterns for Robots and Humans in Web Archives Invalid requests 28 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0" http://web.archive.org/web/20100102003557/ about:blank
  • 29. Access Patterns for Robots and Humans in Web Archives Requests that had 3xx status code 29 http://web.archive.org/web/20130114160045/ http://www.jcdl.org/0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
  • 30. Access Patterns for Robots and Humans in Web Archives Requests that had 3xx status code 30 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0" http://web.archive.org/web/20130114160045/ http://www.jcdl.org/ curl -I "http://web.archive.org/web/20140004100000/http://www.jcdl.org/" HTTP/1.1 302 Moved Temporarily Server: Tengine/1.4.3 Date: Tue, 02 Jul 2013 19:48:59 GMT Content-Type: application/octet-stream Content-Length: 0 Connection: keep-alive set-cookie: wayback_server=10; Domain=archive.org; Path=/; Expires=Thu, 01-Aug-13 19:48:59 GMT; Location: /web/20130114160045/http://www.jcdl.org/
  • 31. Access Patterns for Robots and Humans in Web Archives Requests that had 3xx status code 31 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0" http://web.archive.org/web/20130114160045/ http://www.jcdl.org/
  • 32. Access Patterns for Robots and Humans in Web Archives Session: set of web pages requested by a particular user 32 1 mins 4 mins 3 mins 9 mins p1 p2 p3 p4 p5
  • 33. Access Patterns for Robots and Humans in Web Archives Session: set of web pages requested by a particular user 33 1 mins 4 mins 3 mins 9 mins p1 p2 p3 p4 p5 Time between two requests ≤ 10
  • 34. Access Patterns for Robots and Humans in Web Archives Session Identification • Grouping: based on the IP and User- Agent • Threshold timeout: 10 minutes Liu et al. 2007, Spiliopoulou et al. 2003 34
  • 35. Access Patterns for Robots and Humans in Web Archives Robot Detection is a big challenge 35 I’m not a robot
  • 36. Access Patterns for Robots and Humans in Web Archives Distinguishing Robots from Humans 36
  • 37. Access Patterns for Robots and Humans in Web Archives User-Agent Check 0.182.141.149 - - [02/Feb/2012:00:01:51 +0000] "GET http://wayback.archive.org/web/199906 01000000*/http://www.belizefirst.com/ HTTP/1.0" 200 98507 "-" "Python-urllib/1.17" 37
  • 38. Access Patterns for Robots and Humans in Web Archives Number of User-Agent per IP 38
  • 39. Access Patterns for Robots and Humans in Web Archives Number of User-Agent per IP 39 One IP with User-Agent ≥20 = lying Robot
  • 40. Access Patterns for Robots and Humans in Web Archives Robots.txt file • Session that contains an access for robot.txt is a robot 40 0.182.141.149 - - [02/Feb/2012:06:20:46 +0000] "GET http://web.archive.org/robots.txt HTTP/1.0" 200 125 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)" 0.182.141.149 - - [02/Feb/2012:06:20:19 +0000] "GET http://wayback.archive.org/web/*/http://www.devilscafe.in HTTP/1.1" 404 2168 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)" 0.182.141.149 - - [02/Feb/2012:06:21:19 +0000] "GET http://wayback.archive.org/web/*/http://www.genie.co.il HTTP/1.1" 200 96205 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)"
  • 41. Access Patterns for Robots and Humans in Web Archives 6 requests, 2 seconds  robot 41 0.182.141.149 - - [02/Feb/2012:07:00:01 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 106433 “-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:01 +0000] "GET http://wayback.archive.org/web/*/http://www.bbc.com HTTP/1.1" 200 566433 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET http://wayback.archive.org/web/*/http://www.google.com HTTP/1.1" 200 96433 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET http://wayback.archive.org/web/*/http://www.yahoo.com HTTP/1.1" 200 933333 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET http://wayback.archive.org/web/*/http://www.bing.com HTTP/1.1" 200 964333 “-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:3 +0000] "GET http://wayback.archive.org/web/*/http://www.jcdl.org HTTP/1.1" 200 123233 “-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
  • 42. Access Patterns for Robots and Humans in Web Archives 3 requests, 520 seconds (9 minutes)  human 42 0.11.160.13 - - [02/Feb/2012:07:00:00 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 106433 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.11.160.13 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/20100330042821/http://www.cnn.com HTTP/1.1" 200 566433 " http://wayback.archive.org/web/*/http://www.cnn.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.11.160.13 - - [02/Feb/2012:07:08:00 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 " http://wayback.archive.org/web/*/http://www.cnn.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
  • 43. Access Patterns for Robots and Humans in Web Archives 0.5 is a Good Browsing Speed Threshold for Distinguishing Robots and Humans (Nithya et al. 2012 , Reddy et al. 2012) 43 Browsing Speed (BS) BS = 𝑠𝑒𝑠𝑠𝑖𝑜𝑛 𝑙𝑒𝑛𝑔𝑡ℎ 𝑠𝑒𝑠𝑠𝑖𝑜𝑛 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝐵𝑆 = ≤ 0.5 𝐻𝑢𝑚𝑎𝑛𝑠 > 0.5 𝑅𝑜𝑏𝑜𝑡𝑠
  • 44. Access Patterns for Robots and Humans in Web Archives Image-to-HTML Ratio 44 If I download these, I’m not a robot
  • 45. Access Patterns for Robots and Humans in Web Archives Image-to-HTML Ratio • The ratio between the number of image files and the number of HTML files per session • Robots sessions are less than 1:10 image to HTML ratio, as suggested by Stassopoulou et al. 2005 45
  • 46. Access Patterns for Robots and Humans in Web Archives Image-to-HTML is the best in detecting robots 46
  • 47. Access Patterns for Robots and Humans in Web Archives Traffic Analysis • Records remaining after cleaning: 21.3% (426,317 out of 2M) • Unique IPs: 21,932 • Users: 33,841 • Sessions: 37,634 47
  • 48. Access Patterns for Robots and Humans in Web Archives Robots have longer sessions than humans 48
  • 49. Access Patterns for Robots and Humans in Web Archives Humans spend more time than Robots 49
  • 50. Access Patterns for Robots and Humans in Web Archives Robots outnumber humans in terms of: 50 Sessions 10 1 Raw HTTP Accesses 5 4 MB Transferred 4 1
  • 51. Access Patterns for Robots and Humans in Web Archives User Access Patterns in Web Archives • Dip • Dive • Slide • Skim 51
  • 52. Access Patterns for Robots and Humans in Web Archives Dip: simple access to TimeMap or memento 52 TimeMap Memento
  • 53. Access Patterns for Robots and Humans in Web Archives Dive: different pages at approximately the same archive time 53 November 12, 2009 11:55:54 November 12, 2009 05:37:22 November 12, 2009 05:38:02
  • 54. Access Patterns for Robots and Humans in Web Archives Slide: the same page at different archive times 54 March 18, 2013 13:56:00 November 15, 2009 05:33:01 July 31, 2006 23:55:45
  • 55. Access Patterns for Robots and Humans in Web Archives Skim: lists of TimeMaps 55 http://web.archive.org/web/*/ http://cnn.com/ http://web.archive.org/web/*/ http://www.bbc.com/ http://web.archive.org/web/*/ http://www.nytimes.com/
  • 56. Access Patterns for Robots and Humans in Web Archives Everybody Dips, Humans Dive, Robots Skim 56 Robots (34,203 sessions) Humans (3,431 sessions)
  • 57. Access Patterns for Robots and Humans in Web Archives Pattern Length 57 Slide length = 4 Skim length = 3
  • 58. Access Patterns for Robots and Humans in Web Archives Small Medians, Large Standard Deviations 58
  • 59. Access Patterns for Robots and Humans in Web Archives Only recent past exhibits locality of reference 59
  • 60. Access Patterns for Robots and Humans in Web Archives Only recent past exhibits locality of reference 60 Cache replacement policies should favor recent past
  • 61. Access Patterns for Robots and Humans in Web Archives Conclusions • We introduced traffic analysis for the Wayback Machine • We discovered that robots outnumber humans – 10:1 in terms of sessions – 5:4 in terms of raw, unfiltered requests – 4:1 in terms of megabytes transferred – Robots need APIs http://arxiv.org/abs/1305.5959 • We Identified four major web archive access patterns – Dip – Slide – Dive – Skim • Only recent past exhibits locality of reference 61
  • 62. Access Patterns for Robots and Humans in Web Archives Extra Slides 62
  • 63. Access Patterns for Robots and Humans in Web Archives The Features of the Samples Days Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7 Feb 8 Mean SD SE Duration 0:33:12 0:31:15 0:40:34 0:42:57 0:29:35 0:25:45 0:24:33 0:32:33 0:06:29 0:02:27 GET 98.4% 99.3% 97.7% 97.9% 99.4% 99.7% 99.8% 99% 0.8% 0.3% Embedded 47.4% 34.8% 43.7% 42.7% 41.9% 44.7% 46.8% 43.1% 3.9% 1.5% SI Robots 6.2% 12.0% 7.7% 7.7% 2.9% 3.5% 3.8% 6.3% 3.0% 1.1% NullRef 42.6% 56.6% 47.5% 47.0% 49.4% 42.6% 43.9% 47.1% 4.6% 1.7% s2xx 33.7% 32.4% 34.2% 33.2% 34.1% 33.4% 33.6% 33.5% 0.6% 0.2% s3xx 51.8% 52.3% 50.8% 52.2% 51.7% 51.9% 53.2% 52.0% 0.7% 0.3% s4xx 11.7% 13.1% 12.0% 11.6% 11.2% 10.3% 10.1% 11.4% 0.9% 0.4% s5xx 2.8% 2.3% 3.0% 2.9% 3.0% 4.4% 3.1% 3.1% 0.6% 0.2% Cleaned 21.3% 23.0% 17.6% 17.7% 20.7% 18.1% 16.9% 19.3% 2.2% 0.8% Sessions 37,634 31,731 32,159 28,750 36,087 35,848 32,117 33,475 2,896 1,094 63
  • 64. Access Patterns for Robots and Humans in Web Archives Very Small Standard Errors among Samples 64 Days Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7 Feb 8 Mean SD SE Duration 0:33:12 0:31:15 0:40:34 0:42:57 0:29:35 0:25:45 0:24:33 0:32:33 0:06:29 0:02:27 GET 98.4% 99.3% 97.7% 97.9% 99.4% 99.7% 99.8% 99% 0.8% 0.3% Embedded 47.4% 34.8% 43.7% 42.7% 41.9% 44.7% 46.8% 43.1% 3.9% 1.5% SI Robots 6.2% 12.0% 7.7% 7.7% 2.9% 3.5% 3.8% 6.3% 3.0% 1.1% NullRef 42.6% 56.6% 47.5% 47.0% 49.4% 42.6% 43.9% 47.1% 4.6% 1.7% s2xx 33.7% 32.4% 34.2% 33.2% 34.1% 33.4% 33.6% 33.5% 0.6% 0.2% s3xx 51.8% 52.3% 50.8% 52.2% 51.7% 51.9% 53.2% 52.0% 0.7% 0.3% s4xx 11.7% 13.1% 12.0% 11.6% 11.2% 10.3% 10.1% 11.4% 0.9% 0.4% s5xx 2.8% 2.3% 3.0% 2.9% 3.0% 4.4% 3.1% 3.1% 0.6% 0.2% Cleaned 21.3% 23.0% 17.6% 17.7% 20.7% 18.1% 16.9% 19.3% 2.2% 0.8% Sessions 37,634 31,731 32,159 28,750 36,087 35,848 32,117 33,475 2,896 1,094
  • 65. Access Patterns for Robots and Humans in Web Archives Feb. 2, 2012 sample is representative Days Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7 Feb 8 Mean SD SE Duration 0:33:12 0:31:15 0:40:34 0:42:57 0:29:35 0:25:45 0:24:33 0:32:33 0:06:29 0:02:27 GET 98.4% 99.3% 97.7% 97.9% 99.4% 99.7% 99.8% 99% 0.8% 0.3% Embedded 47.4% 34.8% 43.7% 42.7% 41.9% 44.7% 46.8% 43.1% 3.9% 1.5% SI Robots 6.2% 12.0% 7.7% 7.7% 2.9% 3.5% 3.8% 6.3% 3.0% 1.1% NullRef 42.6% 56.6% 47.5% 47.0% 49.4% 42.6% 43.9% 47.1% 4.6% 1.7% s2xx 33.7% 32.4% 34.2% 33.2% 34.1% 33.4% 33.6% 33.5% 0.6% 0.2% s3xx 51.8% 52.3% 50.8% 52.2% 51.7% 51.9% 53.2% 52.0% 0.7% 0.3% s4xx 11.7% 13.1% 12.0% 11.6% 11.2% 10.3% 10.1% 11.4% 0.9% 0.4% s5xx 2.8% 2.3% 3.0% 2.9% 3.0% 4.4% 3.1% 3.1% 0.6% 0.2% Cleaned 21.3% 23.0% 17.6% 17.7% 20.7% 18.1% 16.9% 19.3% 2.2% 0.8% Sessions 37,634 31,731 32,159 28,750 36,087 35,848 32,117 33,475 2,896 1,094 65
  • 66. Access Patterns for Robots and Humans in Web Archives Results of Data Cleaning • The records remained after cleaning are 21.3% of the requests in the raw file. 66
  • 67. Access Patterns for Robots and Humans in Web Archives Robots outnumber humans in terms of: 67 Sessions 10 1 Raw HTTP Accesses 5 4 MB Transferred 4 1 Users # Sessions # Requests (Raw) # Transferred MB Robots 34,203 (90.9%) 1,002,573 (50.1%) 20,010 Humans 3,431 (9.10%) 810,049 (40.5%) 4,459
  • 68. Access Patterns for Robots and Humans in Web Archives Humans exhibit Dip and Dive, while robots exhibit Dip and Skim 68 Robots Humans 328 Slides 571 Dives 1167 Slides 1942 Dives
  • 69. Access Patterns for Robots and Humans in Web Archives The total number of mementos available for 2011 was similar to previous years. 69