• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Access Patterns for Robots and Humans in Web Archives
 

Access Patterns for Robots and Humans in Web Archives

on

  • 2,124 views

 

Statistics

Views

Total Views
2,124
Views on SlideShare
1,413
Embed Views
711

Actions

Likes
0
Downloads
2
Comments
0

30 Embeds 711

http://ws-dl.blogspot.com 514
http://ws-dl.blogspot.in 30
http://ws-dl.blogspot.de 23
http://ws-dl.blogspot.nl 17
http://ws-dl.blogspot.co.uk 16
http://ws-dl.blogspot.ru 15
https://twitter.com 12
http://ws-dl.blogspot.it 11
http://ws-dl.blogspot.ca 10
http://ws-dl.blogspot.gr 9
http://ws-dl.blogspot.com.au 8
http://ws-dl.blogspot.fr 7
http://ws-dl.blogspot.sg 6
http://ws-dl.blogspot.com.es 4
http://cloud.feedly.com 3
http://ws-dl.blogspot.kr 3
http://ws-dl.blogspot.ch 3
http://ws-dl.blogspot.se 2
http://ws-dl.blogspot.pt 2
http://ws-dl.blogspot.cz 2
http://ws-dl.blogspot.be 2
http://ws-dl.blogspot.com.ar 2
http://ws-dl.blogspot.fi 2
http://ws-dl.blogspot.jp 2
http://ws-dl.blogspot.hk 1
http://www.ws-dl.blogspot.com 1
http://ws-dl.blogspot.com.br 1
http://newsblur.com 1
http://ws-dl.blogspot.co.at 1
http://ws-dl.blogspot.co.nz 1
More...

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Access Patterns for Robots and Humans in Web Archives Access Patterns for Robots and Humans in Web Archives Presentation Transcript

    • Access Patterns for Robots and Humans in Web Archives Yasmin AlNoamany, Michele C. Weigle, Michael L. Nelson Computer Science Department Old Dominion University, Norfolk, VA yasmin@cs.odu.edu Access Patterns for Robots and Humans in Web Archives
    • Access Patterns for Robots and Humans in Web Archives 2 0.204.48.255 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/ahrefs.comHTTP/1.0"200 96037 "-" "Mozilla/5.0(Windows NT 6.1; rv:10.0) Gecko/20100101 Firefox/10.0" 0.241.150.135 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/hperlinknow.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1) Gecko/20100101Firefox/9.0.1" 0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20120202063732/http://b.scorecardresearch.com/beacon.jsHTTP/1.1"403 127 "http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" 0.62.96.215 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://carbolicsmokeall.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.26)Gecko/20120128BTRS87692Firefox/3.6.26( .NET CLR 3.5.30729; .NET4.0E)" 0.55.251.218 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/image/15_bar.gifHTTP/1.1"302 0 "http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/alltop.html""Mozilla/5.0(compatible;MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" 0.123.255.46 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20090321190441im_/http://www.chermside.com/wp-content/uploads/bowlsclub.gifHTTP/1.1"302 0 "http://web.archive.org/web/20090321190441/http://www.chermside.com/""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77Safari/535.7" 0.73.170.52 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://www.pornhub.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/534.51.22(KHTML, like Gecko) Version/5.1.1Safari/534.51.22" 0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/bot_1.gifHTTP/1.1"302 0 "http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko) Chrome/17.0.963.46Safari/535.11" 0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://liveweb.archive.org/http://photos.modelmayhem.com/avatars/1/9/3/2/6/7/4f1e2fb2e4ed4_t.jpgHTTP/1.1"200 7682 "http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" 0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/sub_right.gifHTTP/1.1"302 0 "http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko) Chrome/17.0.963.46Safari/535.11" 0.227.26.32 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.friendscafe.orgHTTP/1.1"200 102279 "-" "Mozilla/5.0 (Windows NT 5.1; rv:9.0.1) Gecko/20100101Firefox/9.0.1" 0.29.194.93 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://tw.18dao.netHTTP/1.1"200 96951 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW; rv:1.9.2.25)Gecko/20111212 AlexaToolbar/alxf-2.13Firefox/3.6.25( .NET CLR 3.5.30729)" 0.90.22.18 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.bookingbug.comHTTP/1.1"200 104622 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.18)Gecko/20110614Firefox/3.6.18" 0.7.73.16 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.orghttp://web.archive.org/web/20070930062203/http://profiles.yahoo.com/powertrip_02HTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.15)Gecko/20110303Firefox/3.6.15& vbCrlfAccept:text/javascript,image/gif,image/x-xbitmap,image/jpeg,image/pjpeg, application/x-shockwave-flash,application/vnd.ms-excel,applicati" 0.49.73.161 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20061230183944im_/http://www.i3dthemes.com/_images/icons/rss_small.jpgHTTP/1.1"302 0 "-" "Mozilla/5.0" …
    • Access Patterns for Robots and Humans in Web Archives 3 0.204.48.255 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/ahrefs.comHTTP/1.0"200 96037 "-" "Mozilla/5.0(Windows NT 6.1; rv:10.0) Gecko/20100101 Firefox/10.0" 0.241.150.135 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/hperlinknow.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1) Gecko/20100101Firefox/9.0.1" 0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20120202063732/http://b.scorecardresearch.com/beacon.jsHTTP/1.1"403 127 "http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" 0.62.96.215 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://carbolicsmokeall.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.26)Gecko/20120128BTRS87692Firefox/3.6.26( .NET CLR 3.5.30729; .NET4.0E)" 0.55.251.218 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/image/15_bar.gifHTTP/1.1"302 0 "http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/alltop.html""Mozilla/5.0(compatible;MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" 0.123.255.46 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20090321190441im_/http://www.chermside.com/wp-content/uploads/bowlsclub.gifHTTP/1.1"302 0 "http://web.archive.org/web/20090321190441/http://www.chermside.com/""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77Safari/535.7" 0.73.170.52 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://www.pornhub.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/534.51.22(KHTML, like Gecko) Version/5.1.1Safari/534.51.22" 0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/bot_1.gifHTTP/1.1"302 0 "http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko) Chrome/17.0.963.46Safari/535.11" 0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://liveweb.archive.org/http://photos.modelmayhem.com/avatars/1/9/3/2/6/7/4f1e2fb2e4ed4_t.jpgHTTP/1.1"200 7682 "http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" 0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/sub_right.gifHTTP/1.1"302 0 "http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko) Chrome/17.0.963.46Safari/535.11" 0.227.26.32 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.friendscafe.orgHTTP/1.1"200 102279 "-" "Mozilla/5.0 (Windows NT 5.1; rv:9.0.1) Gecko/20100101Firefox/9.0.1" 0.29.194.93 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://tw.18dao.netHTTP/1.1"200 96951 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW; rv:1.9.2.25)Gecko/20111212 AlexaToolbar/alxf-2.13Firefox/3.6.25( .NET CLR 3.5.30729)" 0.90.22.18 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.bookingbug.comHTTP/1.1"200 104622 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.18)Gecko/20110614Firefox/3.6.18" 0.7.73.16 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.orghttp://web.archive.org/web/20070930062203/http://profiles.yahoo.com/powertrip_02HTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.15)Gecko/20110303Firefox/3.6.15& vbCrlfAccept:text/javascript,image/gif,image/x-xbitmap,image/jpeg,image/pjpeg, application/x-shockwave-flash,application/vnd.ms-excel,applicati" 0.49.73.161 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20061230183944im_/http://www.i3dthemes.com/_images/icons/rss_small.jpgHTTP/1.1"302 0 "-" "Mozilla/5.0" …
    • Access Patterns for Robots and Humans in Web Archives Motivation • There have been many studies for web access patterns • This is the first study using Internet Archive’s web server logs to discover how users access web archives 4
    • Access Patterns for Robots and Humans in Web Archives Research Question • How do users, both humans and robots, access web archives? 5
    • Access Patterns for Robots and Humans in Web Archives Methodology 6
    • Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 7 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" 0.247.222.86 - - [02/Feb/2012:07:03:55 +0000] "GET http://web.archive.org/web/20130318135600/http://www.cnn.com HTTP/1.1" 200 18875 "http://wayback.archive.org/web/*/http://www.aura.vu" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7"}
    • Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 8 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86
    • Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 9 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 IPs had been anonymized by Internet Archive
    • Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 10 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000
    • Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 11 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET
    • Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 12 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com
    • Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 13 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com TimeMap
    • Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 14 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/20130318135600/http://www.cnn.com 0.247.222.86 - - [02/Feb/2012:07:03:55 +0000] "GET http://web.archive.org/web/20130318135600/http://www.cnn.com/ HTTP/1.1" 200 18875 "http://wayback.archive.org/web/*/http://www.cnn.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7"} Memento TimeMap
    • Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 15 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com • Protocol: HTTP/1.1
    • Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 16 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com • Protocol: HTTP/1.1 • HTTP status code: 200
    • Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 17 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com • Protocol: HTTP/1.1 • HTTP status code: 200 • Bytes sent: 96433
    • Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 18 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com • Protocol: HTTP/1.1 • HTTP status code: 200 • Bytes sent: 96433 • Referring URI: http://www.archive.org/web/web.php
    • Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 19 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com • Protocol: HTTP/1.1 • HTTP status code: 200 • Bytes sent: 96433 • Referring URI: http://www.archive.org/web/web.php • User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7
    • Access Patterns for Robots and Humans in Web Archives Dataset • More than 82 million requests per day come to the Wayback Machine • Cluster Sampling: a week, Feb. 2-8, 2012 • Random Sampling: random slice (2 million requests) from each day of the week • We looked at all these days and found that 2 Feb. is a representative sample – For details, look at Section 4.2 and Table 3 in the paper 20
    • Access Patterns for Robots and Humans in Web Archives Pre-processing • Data Cleaning • Session Identification • Robot Detection 21
    • Access Patterns for Robots and Humans in Web Archives Data Cleaning 22 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0" http://web.archive.org/web/20070519015308/ http://www.jcdl.org/
    • Access Patterns for Robots and Humans in Web Archives Embedded Resources 23 http://web.archive.org/web/20070519015308/ http://www.jcdl.org/ 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
    • Access Patterns for Robots and Humans in Web Archives Embedded Resources 24 http://web.archive.org/web/20070519015308/ http://www.jcdl.org/ 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
    • Access Patterns for Robots and Humans in Web Archives Static Resources 25 http://web.archive.org/web/20070519015308/ http://www.jcdl.org/ 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
    • Access Patterns for Robots and Humans in Web Archives Static Resources 26 http://web.archive.org/web/20070519015308/ http://www.jcdl.org/ 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
    • Access Patterns for Robots and Humans in Web Archives Invalid requests 27 http://web.archive.org/web/20100102003557/ about:blank 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
    • Access Patterns for Robots and Humans in Web Archives Invalid requests 28 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0" http://web.archive.org/web/20100102003557/ about:blank
    • Access Patterns for Robots and Humans in Web Archives Requests that had 3xx status code 29 http://web.archive.org/web/20130114160045/ http://www.jcdl.org/0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
    • Access Patterns for Robots and Humans in Web Archives Requests that had 3xx status code 30 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0" http://web.archive.org/web/20130114160045/ http://www.jcdl.org/ curl -I "http://web.archive.org/web/20140004100000/http://www.jcdl.org/" HTTP/1.1 302 Moved Temporarily Server: Tengine/1.4.3 Date: Tue, 02 Jul 2013 19:48:59 GMT Content-Type: application/octet-stream Content-Length: 0 Connection: keep-alive set-cookie: wayback_server=10; Domain=archive.org; Path=/; Expires=Thu, 01-Aug-13 19:48:59 GMT; Location: /web/20130114160045/http://www.jcdl.org/
    • Access Patterns for Robots and Humans in Web Archives Requests that had 3xx status code 31 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0" http://web.archive.org/web/20130114160045/ http://www.jcdl.org/
    • Access Patterns for Robots and Humans in Web Archives Session: set of web pages requested by a particular user 32 1 mins 4 mins 3 mins 9 mins p1 p2 p3 p4 p5
    • Access Patterns for Robots and Humans in Web Archives Session: set of web pages requested by a particular user 33 1 mins 4 mins 3 mins 9 mins p1 p2 p3 p4 p5 Time between two requests ≤ 10
    • Access Patterns for Robots and Humans in Web Archives Session Identification • Grouping: based on the IP and User- Agent • Threshold timeout: 10 minutes Liu et al. 2007, Spiliopoulou et al. 2003 34
    • Access Patterns for Robots and Humans in Web Archives Robot Detection is a big challenge 35 I’m not a robot
    • Access Patterns for Robots and Humans in Web Archives Distinguishing Robots from Humans 36
    • Access Patterns for Robots and Humans in Web Archives User-Agent Check 0.182.141.149 - - [02/Feb/2012:00:01:51 +0000] "GET http://wayback.archive.org/web/199906 01000000*/http://www.belizefirst.com/ HTTP/1.0" 200 98507 "-" "Python-urllib/1.17" 37
    • Access Patterns for Robots and Humans in Web Archives Number of User-Agent per IP 38
    • Access Patterns for Robots and Humans in Web Archives Number of User-Agent per IP 39 One IP with User-Agent ≥20 = lying Robot
    • Access Patterns for Robots and Humans in Web Archives Robots.txt file • Session that contains an access for robot.txt is a robot 40 0.182.141.149 - - [02/Feb/2012:06:20:46 +0000] "GET http://web.archive.org/robots.txt HTTP/1.0" 200 125 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)" 0.182.141.149 - - [02/Feb/2012:06:20:19 +0000] "GET http://wayback.archive.org/web/*/http://www.devilscafe.in HTTP/1.1" 404 2168 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)" 0.182.141.149 - - [02/Feb/2012:06:21:19 +0000] "GET http://wayback.archive.org/web/*/http://www.genie.co.il HTTP/1.1" 200 96205 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)"
    • Access Patterns for Robots and Humans in Web Archives 6 requests, 2 seconds  robot 41 0.182.141.149 - - [02/Feb/2012:07:00:01 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 106433 “-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:01 +0000] "GET http://wayback.archive.org/web/*/http://www.bbc.com HTTP/1.1" 200 566433 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET http://wayback.archive.org/web/*/http://www.google.com HTTP/1.1" 200 96433 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET http://wayback.archive.org/web/*/http://www.yahoo.com HTTP/1.1" 200 933333 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET http://wayback.archive.org/web/*/http://www.bing.com HTTP/1.1" 200 964333 “-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:3 +0000] "GET http://wayback.archive.org/web/*/http://www.jcdl.org HTTP/1.1" 200 123233 “-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
    • Access Patterns for Robots and Humans in Web Archives 3 requests, 520 seconds (9 minutes)  human 42 0.11.160.13 - - [02/Feb/2012:07:00:00 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 106433 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.11.160.13 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/20100330042821/http://www.cnn.com HTTP/1.1" 200 566433 " http://wayback.archive.org/web/*/http://www.cnn.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.11.160.13 - - [02/Feb/2012:07:08:00 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 " http://wayback.archive.org/web/*/http://www.cnn.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
    • Access Patterns for Robots and Humans in Web Archives 0.5 is a Good Browsing Speed Threshold for Distinguishing Robots and Humans (Nithya et al. 2012 , Reddy et al. 2012) 43 Browsing Speed (BS) BS = 𝑠𝑒𝑠𝑠𝑖𝑜𝑛 𝑙𝑒𝑛𝑔𝑡ℎ 𝑠𝑒𝑠𝑠𝑖𝑜𝑛 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝐵𝑆 = ≤ 0.5 𝐻𝑢𝑚𝑎𝑛𝑠 > 0.5 𝑅𝑜𝑏𝑜𝑡𝑠
    • Access Patterns for Robots and Humans in Web Archives Image-to-HTML Ratio 44 If I download these, I’m not a robot
    • Access Patterns for Robots and Humans in Web Archives Image-to-HTML Ratio • The ratio between the number of image files and the number of HTML files per session • Robots sessions are less than 1:10 image to HTML ratio, as suggested by Stassopoulou et al. 2005 45
    • Access Patterns for Robots and Humans in Web Archives Image-to-HTML is the best in detecting robots 46
    • Access Patterns for Robots and Humans in Web Archives Traffic Analysis • Records remaining after cleaning: 21.3% (426,317 out of 2M) • Unique IPs: 21,932 • Users: 33,841 • Sessions: 37,634 47
    • Access Patterns for Robots and Humans in Web Archives Robots have longer sessions than humans 48
    • Access Patterns for Robots and Humans in Web Archives Humans spend more time than Robots 49
    • Access Patterns for Robots and Humans in Web Archives Robots outnumber humans in terms of: 50 Sessions 10 1 Raw HTTP Accesses 5 4 MB Transferred 4 1
    • Access Patterns for Robots and Humans in Web Archives User Access Patterns in Web Archives • Dip • Dive • Slide • Skim 51
    • Access Patterns for Robots and Humans in Web Archives Dip: simple access to TimeMap or memento 52 TimeMap Memento
    • Access Patterns for Robots and Humans in Web Archives Dive: different pages at approximately the same archive time 53 November 12, 2009 11:55:54 November 12, 2009 05:37:22 November 12, 2009 05:38:02
    • Access Patterns for Robots and Humans in Web Archives Slide: the same page at different archive times 54 March 18, 2013 13:56:00 November 15, 2009 05:33:01 July 31, 2006 23:55:45
    • Access Patterns for Robots and Humans in Web Archives Skim: lists of TimeMaps 55 http://web.archive.org/web/*/ http://cnn.com/ http://web.archive.org/web/*/ http://www.bbc.com/ http://web.archive.org/web/*/ http://www.nytimes.com/
    • Access Patterns for Robots and Humans in Web Archives Everybody Dips, Humans Dive, Robots Skim 56 Robots (34,203 sessions) Humans (3,431 sessions)
    • Access Patterns for Robots and Humans in Web Archives Pattern Length 57 Slide length = 4 Skim length = 3
    • Access Patterns for Robots and Humans in Web Archives Small Medians, Large Standard Deviations 58
    • Access Patterns for Robots and Humans in Web Archives Only recent past exhibits locality of reference 59
    • Access Patterns for Robots and Humans in Web Archives Only recent past exhibits locality of reference 60 Cache replacement policies should favor recent past
    • Access Patterns for Robots and Humans in Web Archives Conclusions • We introduced traffic analysis for the Wayback Machine • We discovered that robots outnumber humans – 10:1 in terms of sessions – 5:4 in terms of raw, unfiltered requests – 4:1 in terms of megabytes transferred – Robots need APIs http://arxiv.org/abs/1305.5959 • We Identified four major web archive access patterns – Dip – Slide – Dive – Skim • Only recent past exhibits locality of reference 61
    • Access Patterns for Robots and Humans in Web Archives Extra Slides 62
    • Access Patterns for Robots and Humans in Web Archives The Features of the Samples Days Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7 Feb 8 Mean SD SE Duration 0:33:12 0:31:15 0:40:34 0:42:57 0:29:35 0:25:45 0:24:33 0:32:33 0:06:29 0:02:27 GET 98.4% 99.3% 97.7% 97.9% 99.4% 99.7% 99.8% 99% 0.8% 0.3% Embedded 47.4% 34.8% 43.7% 42.7% 41.9% 44.7% 46.8% 43.1% 3.9% 1.5% SI Robots 6.2% 12.0% 7.7% 7.7% 2.9% 3.5% 3.8% 6.3% 3.0% 1.1% NullRef 42.6% 56.6% 47.5% 47.0% 49.4% 42.6% 43.9% 47.1% 4.6% 1.7% s2xx 33.7% 32.4% 34.2% 33.2% 34.1% 33.4% 33.6% 33.5% 0.6% 0.2% s3xx 51.8% 52.3% 50.8% 52.2% 51.7% 51.9% 53.2% 52.0% 0.7% 0.3% s4xx 11.7% 13.1% 12.0% 11.6% 11.2% 10.3% 10.1% 11.4% 0.9% 0.4% s5xx 2.8% 2.3% 3.0% 2.9% 3.0% 4.4% 3.1% 3.1% 0.6% 0.2% Cleaned 21.3% 23.0% 17.6% 17.7% 20.7% 18.1% 16.9% 19.3% 2.2% 0.8% Sessions 37,634 31,731 32,159 28,750 36,087 35,848 32,117 33,475 2,896 1,094 63
    • Access Patterns for Robots and Humans in Web Archives Very Small Standard Errors among Samples 64 Days Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7 Feb 8 Mean SD SE Duration 0:33:12 0:31:15 0:40:34 0:42:57 0:29:35 0:25:45 0:24:33 0:32:33 0:06:29 0:02:27 GET 98.4% 99.3% 97.7% 97.9% 99.4% 99.7% 99.8% 99% 0.8% 0.3% Embedded 47.4% 34.8% 43.7% 42.7% 41.9% 44.7% 46.8% 43.1% 3.9% 1.5% SI Robots 6.2% 12.0% 7.7% 7.7% 2.9% 3.5% 3.8% 6.3% 3.0% 1.1% NullRef 42.6% 56.6% 47.5% 47.0% 49.4% 42.6% 43.9% 47.1% 4.6% 1.7% s2xx 33.7% 32.4% 34.2% 33.2% 34.1% 33.4% 33.6% 33.5% 0.6% 0.2% s3xx 51.8% 52.3% 50.8% 52.2% 51.7% 51.9% 53.2% 52.0% 0.7% 0.3% s4xx 11.7% 13.1% 12.0% 11.6% 11.2% 10.3% 10.1% 11.4% 0.9% 0.4% s5xx 2.8% 2.3% 3.0% 2.9% 3.0% 4.4% 3.1% 3.1% 0.6% 0.2% Cleaned 21.3% 23.0% 17.6% 17.7% 20.7% 18.1% 16.9% 19.3% 2.2% 0.8% Sessions 37,634 31,731 32,159 28,750 36,087 35,848 32,117 33,475 2,896 1,094
    • Access Patterns for Robots and Humans in Web Archives Feb. 2, 2012 sample is representative Days Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7 Feb 8 Mean SD SE Duration 0:33:12 0:31:15 0:40:34 0:42:57 0:29:35 0:25:45 0:24:33 0:32:33 0:06:29 0:02:27 GET 98.4% 99.3% 97.7% 97.9% 99.4% 99.7% 99.8% 99% 0.8% 0.3% Embedded 47.4% 34.8% 43.7% 42.7% 41.9% 44.7% 46.8% 43.1% 3.9% 1.5% SI Robots 6.2% 12.0% 7.7% 7.7% 2.9% 3.5% 3.8% 6.3% 3.0% 1.1% NullRef 42.6% 56.6% 47.5% 47.0% 49.4% 42.6% 43.9% 47.1% 4.6% 1.7% s2xx 33.7% 32.4% 34.2% 33.2% 34.1% 33.4% 33.6% 33.5% 0.6% 0.2% s3xx 51.8% 52.3% 50.8% 52.2% 51.7% 51.9% 53.2% 52.0% 0.7% 0.3% s4xx 11.7% 13.1% 12.0% 11.6% 11.2% 10.3% 10.1% 11.4% 0.9% 0.4% s5xx 2.8% 2.3% 3.0% 2.9% 3.0% 4.4% 3.1% 3.1% 0.6% 0.2% Cleaned 21.3% 23.0% 17.6% 17.7% 20.7% 18.1% 16.9% 19.3% 2.2% 0.8% Sessions 37,634 31,731 32,159 28,750 36,087 35,848 32,117 33,475 2,896 1,094 65
    • Access Patterns for Robots and Humans in Web Archives Results of Data Cleaning • The records remained after cleaning are 21.3% of the requests in the raw file. 66
    • Access Patterns for Robots and Humans in Web Archives Robots outnumber humans in terms of: 67 Sessions 10 1 Raw HTTP Accesses 5 4 MB Transferred 4 1 Users # Sessions # Requests (Raw) # Transferred MB Robots 34,203 (90.9%) 1,002,573 (50.1%) 20,010 Humans 3,431 (9.10%) 810,049 (40.5%) 4,459
    • Access Patterns for Robots and Humans in Web Archives Humans exhibit Dip and Dive, while robots exhibit Dip and Skim 68 Robots Humans 328 Slides 571 Dives 1167 Slides 1942 Dives
    • Access Patterns for Robots and Humans in Web Archives The total number of mementos available for 2011 was similar to previous years. 69