0
Access Patterns for Robots
and Humans in Web Archives
Yasmin AlNoamany, Michele C. Weigle, Michael L. Nelson
Computer Scie...
Access Patterns for Robots and Humans in Web Archives 2
0.204.48.255 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.a...
Access Patterns for Robots and Humans in Web Archives 3
0.204.48.255 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.a...
Access Patterns for Robots and Humans in Web Archives
Motivation
• There have been many studies for web access
patterns
• ...
Access Patterns for Robots and Humans in Web Archives
Research Question
• How do users, both humans and robots,
access web...
Access Patterns for Robots and Humans in Web Archives
Methodology
6
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
7
0.247.222.86 - - [02/Feb/201...
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
8
0.247.222.86 - - [02/Feb/201...
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
9
0.247.222.86 - - [02/Feb/201...
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
10
0.247.222.86 - - [02/Feb/20...
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
11
0.247.222.86 - - [02/Feb/20...
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
12
0.247.222.86 - - [02/Feb/20...
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
13
0.247.222.86 - - [02/Feb/20...
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
14
0.247.222.86 - - [02/Feb/20...
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
15
0.247.222.86 - - [02/Feb/20...
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
16
0.247.222.86 - - [02/Feb/20...
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
17
0.247.222.86 - - [02/Feb/20...
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
18
0.247.222.86 - - [02/Feb/20...
Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
19
0.247.222.86 - - [02/Feb/20...
Access Patterns for Robots and Humans in Web Archives
Dataset
• More than 82 million requests per day come
to the Wayback ...
Access Patterns for Robots and Humans in Web Archives
Pre-processing
• Data Cleaning
• Session Identification
• Robot Dete...
Access Patterns for Robots and Humans in Web Archives
Data Cleaning
22
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web...
Access Patterns for Robots and Humans in Web Archives
Embedded Resources
23
http://web.archive.org/web/20070519015308/
htt...
Access Patterns for Robots and Humans in Web Archives
Embedded Resources
24
http://web.archive.org/web/20070519015308/
htt...
Access Patterns for Robots and Humans in Web Archives
Static Resources
25
http://web.archive.org/web/20070519015308/
http:...
Access Patterns for Robots and Humans in Web Archives
Static Resources
26
http://web.archive.org/web/20070519015308/
http:...
Access Patterns for Robots and Humans in Web Archives
Invalid requests
27
http://web.archive.org/web/20100102003557/
about...
Access Patterns for Robots and Humans in Web Archives
Invalid requests
28
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://...
Access Patterns for Robots and Humans in Web Archives
Requests that had 3xx status code
29
http://web.archive.org/web/2013...
Access Patterns for Robots and Humans in Web Archives
Requests that had 3xx status code
30
0.11.160.135 [02/Feb/2012:00:01...
Access Patterns for Robots and Humans in Web Archives
Requests that had 3xx status code
31
0.11.160.135 [02/Feb/2012:00:01...
Access Patterns for Robots and Humans in Web Archives
Session: set of web pages requested
by a particular user
32
1 mins 4...
Access Patterns for Robots and Humans in Web Archives
Session: set of web pages requested
by a particular user
33
1 mins 4...
Access Patterns for Robots and Humans in Web Archives
Session Identification
• Grouping: based on the IP and User-
Agent
•...
Access Patterns for Robots and Humans in Web Archives
Robot Detection is a big challenge
35
I’m not a
robot
Access Patterns for Robots and Humans in Web Archives
Distinguishing Robots from
Humans
36
Access Patterns for Robots and Humans in Web Archives
User-Agent Check
0.182.141.149 - -
[02/Feb/2012:00:01:51 +0000] "GET...
Access Patterns for Robots and Humans in Web Archives
Number of User-Agent per IP
38
Access Patterns for Robots and Humans in Web Archives
Number of User-Agent per IP
39
One IP with User-Agent ≥20 = lying Ro...
Access Patterns for Robots and Humans in Web Archives
Robots.txt file
• Session that contains an access for robot.txt is
a...
Access Patterns for Robots and Humans in Web Archives
6 requests, 2 seconds  robot
41
0.182.141.149 - - [02/Feb/2012:07:0...
Access Patterns for Robots and Humans in Web Archives
3 requests, 520 seconds
(9 minutes)  human
42
0.11.160.13 - - [02/F...
Access Patterns for Robots and Humans in Web Archives
0.5 is a Good Browsing Speed Threshold
for Distinguishing Robots and...
Access Patterns for Robots and Humans in Web Archives
Image-to-HTML Ratio
44
If I download
these, I’m
not a robot
Access Patterns for Robots and Humans in Web Archives
Image-to-HTML Ratio
• The ratio between the number of image files
an...
Access Patterns for Robots and Humans in Web Archives
Image-to-HTML is the best in
detecting robots
46
Access Patterns for Robots and Humans in Web Archives
Traffic Analysis
• Records remaining after cleaning: 21.3%
(426,317 ...
Access Patterns for Robots and Humans in Web Archives
Robots have longer sessions
than humans
48
Access Patterns for Robots and Humans in Web Archives
Humans spend more time
than Robots
49
Access Patterns for Robots and Humans in Web Archives
Robots outnumber humans
in terms of:
50
Sessions
10
1
Raw HTTP
Acces...
Access Patterns for Robots and Humans in Web Archives
User Access Patterns in
Web Archives
• Dip
• Dive
• Slide
• Skim
51
Access Patterns for Robots and Humans in Web Archives
Dip: simple access to
TimeMap or memento
52
TimeMap Memento
Access Patterns for Robots and Humans in Web Archives
Dive: different pages at approximately
the same archive time
53
Nove...
Access Patterns for Robots and Humans in Web Archives
Slide: the same page at different
archive times
54
March 18, 2013 13...
Access Patterns for Robots and Humans in Web Archives
Skim: lists of TimeMaps
55
http://web.archive.org/web/*/
http://cnn....
Access Patterns for Robots and Humans in Web Archives
Everybody Dips, Humans Dive,
Robots Skim
56
Robots (34,203 sessions)...
Access Patterns for Robots and Humans in Web Archives
Pattern Length
57
Slide length = 4
Skim length = 3
Access Patterns for Robots and Humans in Web Archives
Small Medians, Large
Standard Deviations
58
Access Patterns for Robots and Humans in Web Archives
Only recent past exhibits
locality of reference
59
Access Patterns for Robots and Humans in Web Archives
Only recent past exhibits
locality of reference
60
Cache replacement...
Access Patterns for Robots and Humans in Web Archives
Conclusions
• We introduced traffic analysis for the Wayback Machine...
Access Patterns for Robots and Humans in Web Archives
Extra Slides
62
Access Patterns for Robots and Humans in Web Archives
The Features of the Samples
Days Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7...
Access Patterns for Robots and Humans in Web Archives
Very Small Standard Errors among
Samples
64
Days Feb 2 Feb 3 Feb 4 F...
Access Patterns for Robots and Humans in Web Archives
Feb. 2, 2012 sample is representative
Days Feb 2 Feb 3 Feb 4 Feb 5 F...
Access Patterns for Robots and Humans in Web Archives
Results of Data Cleaning
• The records remained after cleaning are 2...
Access Patterns for Robots and Humans in Web Archives
Robots outnumber humans
in terms of:
67
Sessions
10
1
Raw HTTP
Acces...
Access Patterns for Robots and Humans in Web Archives
Humans exhibit Dip and Dive,
while robots exhibit Dip and Skim
68
Ro...
Access Patterns for Robots and Humans in Web Archives
The total number of mementos available
for 2011 was similar to previ...
Upcoming SlideShare
Loading in...5
×

Access Patterns for Robots and Humans in Web Archives

2,559

Published on

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,559
On Slideshare
0
From Embeds
0
Number of Embeds
37
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Access Patterns for Robots and Humans in Web Archives"

  1. 1. Access Patterns for Robots and Humans in Web Archives Yasmin AlNoamany, Michele C. Weigle, Michael L. Nelson Computer Science Department Old Dominion University, Norfolk, VA yasmin@cs.odu.edu Access Patterns for Robots and Humans in Web Archives
  2. 2. Access Patterns for Robots and Humans in Web Archives 2 0.204.48.255 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/ahrefs.comHTTP/1.0"200 96037 "-" "Mozilla/5.0(Windows NT 6.1; rv:10.0) Gecko/20100101 Firefox/10.0" 0.241.150.135 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/hperlinknow.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1) Gecko/20100101Firefox/9.0.1" 0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20120202063732/http://b.scorecardresearch.com/beacon.jsHTTP/1.1"403 127 "http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" 0.62.96.215 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://carbolicsmokeall.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.26)Gecko/20120128BTRS87692Firefox/3.6.26( .NET CLR 3.5.30729; .NET4.0E)" 0.55.251.218 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/image/15_bar.gifHTTP/1.1"302 0 "http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/alltop.html""Mozilla/5.0(compatible;MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" 0.123.255.46 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20090321190441im_/http://www.chermside.com/wp-content/uploads/bowlsclub.gifHTTP/1.1"302 0 "http://web.archive.org/web/20090321190441/http://www.chermside.com/""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77Safari/535.7" 0.73.170.52 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://www.pornhub.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/534.51.22(KHTML, like Gecko) Version/5.1.1Safari/534.51.22" 0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/bot_1.gifHTTP/1.1"302 0 "http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko) Chrome/17.0.963.46Safari/535.11" 0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://liveweb.archive.org/http://photos.modelmayhem.com/avatars/1/9/3/2/6/7/4f1e2fb2e4ed4_t.jpgHTTP/1.1"200 7682 "http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" 0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/sub_right.gifHTTP/1.1"302 0 "http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko) Chrome/17.0.963.46Safari/535.11" 0.227.26.32 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.friendscafe.orgHTTP/1.1"200 102279 "-" "Mozilla/5.0 (Windows NT 5.1; rv:9.0.1) Gecko/20100101Firefox/9.0.1" 0.29.194.93 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://tw.18dao.netHTTP/1.1"200 96951 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW; rv:1.9.2.25)Gecko/20111212 AlexaToolbar/alxf-2.13Firefox/3.6.25( .NET CLR 3.5.30729)" 0.90.22.18 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.bookingbug.comHTTP/1.1"200 104622 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.18)Gecko/20110614Firefox/3.6.18" 0.7.73.16 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.orghttp://web.archive.org/web/20070930062203/http://profiles.yahoo.com/powertrip_02HTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.15)Gecko/20110303Firefox/3.6.15& vbCrlfAccept:text/javascript,image/gif,image/x-xbitmap,image/jpeg,image/pjpeg, application/x-shockwave-flash,application/vnd.ms-excel,applicati" 0.49.73.161 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20061230183944im_/http://www.i3dthemes.com/_images/icons/rss_small.jpgHTTP/1.1"302 0 "-" "Mozilla/5.0" …
  3. 3. Access Patterns for Robots and Humans in Web Archives 3 0.204.48.255 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/ahrefs.comHTTP/1.0"200 96037 "-" "Mozilla/5.0(Windows NT 6.1; rv:10.0) Gecko/20100101 Firefox/10.0" 0.241.150.135 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/hperlinknow.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1) Gecko/20100101Firefox/9.0.1" 0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20120202063732/http://b.scorecardresearch.com/beacon.jsHTTP/1.1"403 127 "http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" 0.62.96.215 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://carbolicsmokeall.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.26)Gecko/20120128BTRS87692Firefox/3.6.26( .NET CLR 3.5.30729; .NET4.0E)" 0.55.251.218 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/image/15_bar.gifHTTP/1.1"302 0 "http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/alltop.html""Mozilla/5.0(compatible;MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" 0.123.255.46 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20090321190441im_/http://www.chermside.com/wp-content/uploads/bowlsclub.gifHTTP/1.1"302 0 "http://web.archive.org/web/20090321190441/http://www.chermside.com/""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77Safari/535.7" 0.73.170.52 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://www.pornhub.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/534.51.22(KHTML, like Gecko) Version/5.1.1Safari/534.51.22" 0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/bot_1.gifHTTP/1.1"302 0 "http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko) Chrome/17.0.963.46Safari/535.11" 0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://liveweb.archive.org/http://photos.modelmayhem.com/avatars/1/9/3/2/6/7/4f1e2fb2e4ed4_t.jpgHTTP/1.1"200 7682 "http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" 0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/sub_right.gifHTTP/1.1"302 0 "http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko) Chrome/17.0.963.46Safari/535.11" 0.227.26.32 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.friendscafe.orgHTTP/1.1"200 102279 "-" "Mozilla/5.0 (Windows NT 5.1; rv:9.0.1) Gecko/20100101Firefox/9.0.1" 0.29.194.93 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://tw.18dao.netHTTP/1.1"200 96951 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW; rv:1.9.2.25)Gecko/20111212 AlexaToolbar/alxf-2.13Firefox/3.6.25( .NET CLR 3.5.30729)" 0.90.22.18 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.bookingbug.comHTTP/1.1"200 104622 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.18)Gecko/20110614Firefox/3.6.18" 0.7.73.16 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.orghttp://web.archive.org/web/20070930062203/http://profiles.yahoo.com/powertrip_02HTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.15)Gecko/20110303Firefox/3.6.15& vbCrlfAccept:text/javascript,image/gif,image/x-xbitmap,image/jpeg,image/pjpeg, application/x-shockwave-flash,application/vnd.ms-excel,applicati" 0.49.73.161 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20061230183944im_/http://www.i3dthemes.com/_images/icons/rss_small.jpgHTTP/1.1"302 0 "-" "Mozilla/5.0" …
  4. 4. Access Patterns for Robots and Humans in Web Archives Motivation • There have been many studies for web access patterns • This is the first study using Internet Archive’s web server logs to discover how users access web archives 4
  5. 5. Access Patterns for Robots and Humans in Web Archives Research Question • How do users, both humans and robots, access web archives? 5
  6. 6. Access Patterns for Robots and Humans in Web Archives Methodology 6
  7. 7. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 7 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" 0.247.222.86 - - [02/Feb/2012:07:03:55 +0000] "GET http://web.archive.org/web/20130318135600/http://www.cnn.com HTTP/1.1" 200 18875 "http://wayback.archive.org/web/*/http://www.aura.vu" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7"}
  8. 8. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 8 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86
  9. 9. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 9 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 IPs had been anonymized by Internet Archive
  10. 10. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 10 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000
  11. 11. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 11 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET
  12. 12. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 12 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com
  13. 13. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 13 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com TimeMap
  14. 14. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 14 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/20130318135600/http://www.cnn.com 0.247.222.86 - - [02/Feb/2012:07:03:55 +0000] "GET http://web.archive.org/web/20130318135600/http://www.cnn.com/ HTTP/1.1" 200 18875 "http://wayback.archive.org/web/*/http://www.cnn.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7"} Memento TimeMap
  15. 15. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 15 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com • Protocol: HTTP/1.1
  16. 16. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 16 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com • Protocol: HTTP/1.1 • HTTP status code: 200
  17. 17. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 17 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com • Protocol: HTTP/1.1 • HTTP status code: 200 • Bytes sent: 96433
  18. 18. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 18 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com • Protocol: HTTP/1.1 • HTTP status code: 200 • Bytes sent: 96433 • Referring URI: http://www.archive.org/web/web.php
  19. 19. Access Patterns for Robots and Humans in Web Archives Sample of Wayback Machine access logs 19 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com • Protocol: HTTP/1.1 • HTTP status code: 200 • Bytes sent: 96433 • Referring URI: http://www.archive.org/web/web.php • User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7
  20. 20. Access Patterns for Robots and Humans in Web Archives Dataset • More than 82 million requests per day come to the Wayback Machine • Cluster Sampling: a week, Feb. 2-8, 2012 • Random Sampling: random slice (2 million requests) from each day of the week • We looked at all these days and found that 2 Feb. is a representative sample – For details, look at Section 4.2 and Table 3 in the paper 20
  21. 21. Access Patterns for Robots and Humans in Web Archives Pre-processing • Data Cleaning • Session Identification • Robot Detection 21
  22. 22. Access Patterns for Robots and Humans in Web Archives Data Cleaning 22 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0" http://web.archive.org/web/20070519015308/ http://www.jcdl.org/
  23. 23. Access Patterns for Robots and Humans in Web Archives Embedded Resources 23 http://web.archive.org/web/20070519015308/ http://www.jcdl.org/ 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
  24. 24. Access Patterns for Robots and Humans in Web Archives Embedded Resources 24 http://web.archive.org/web/20070519015308/ http://www.jcdl.org/ 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
  25. 25. Access Patterns for Robots and Humans in Web Archives Static Resources 25 http://web.archive.org/web/20070519015308/ http://www.jcdl.org/ 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
  26. 26. Access Patterns for Robots and Humans in Web Archives Static Resources 26 http://web.archive.org/web/20070519015308/ http://www.jcdl.org/ 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
  27. 27. Access Patterns for Robots and Humans in Web Archives Invalid requests 27 http://web.archive.org/web/20100102003557/ about:blank 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
  28. 28. Access Patterns for Robots and Humans in Web Archives Invalid requests 28 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0" http://web.archive.org/web/20100102003557/ about:blank
  29. 29. Access Patterns for Robots and Humans in Web Archives Requests that had 3xx status code 29 http://web.archive.org/web/20130114160045/ http://www.jcdl.org/0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
  30. 30. Access Patterns for Robots and Humans in Web Archives Requests that had 3xx status code 30 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0" http://web.archive.org/web/20130114160045/ http://www.jcdl.org/ curl -I "http://web.archive.org/web/20140004100000/http://www.jcdl.org/" HTTP/1.1 302 Moved Temporarily Server: Tengine/1.4.3 Date: Tue, 02 Jul 2013 19:48:59 GMT Content-Type: application/octet-stream Content-Length: 0 Connection: keep-alive set-cookie: wayback_server=10; Domain=archive.org; Path=/; Expires=Thu, 01-Aug-13 19:48:59 GMT; Location: /web/20130114160045/http://www.jcdl.org/
  31. 31. Access Patterns for Robots and Humans in Web Archives Requests that had 3xx status code 31 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/ HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0" http://web.archive.org/web/20130114160045/ http://www.jcdl.org/
  32. 32. Access Patterns for Robots and Humans in Web Archives Session: set of web pages requested by a particular user 32 1 mins 4 mins 3 mins 9 mins p1 p2 p3 p4 p5
  33. 33. Access Patterns for Robots and Humans in Web Archives Session: set of web pages requested by a particular user 33 1 mins 4 mins 3 mins 9 mins p1 p2 p3 p4 p5 Time between two requests ≤ 10
  34. 34. Access Patterns for Robots and Humans in Web Archives Session Identification • Grouping: based on the IP and User- Agent • Threshold timeout: 10 minutes Liu et al. 2007, Spiliopoulou et al. 2003 34
  35. 35. Access Patterns for Robots and Humans in Web Archives Robot Detection is a big challenge 35 I’m not a robot
  36. 36. Access Patterns for Robots and Humans in Web Archives Distinguishing Robots from Humans 36
  37. 37. Access Patterns for Robots and Humans in Web Archives User-Agent Check 0.182.141.149 - - [02/Feb/2012:00:01:51 +0000] "GET http://wayback.archive.org/web/199906 01000000*/http://www.belizefirst.com/ HTTP/1.0" 200 98507 "-" "Python-urllib/1.17" 37
  38. 38. Access Patterns for Robots and Humans in Web Archives Number of User-Agent per IP 38
  39. 39. Access Patterns for Robots and Humans in Web Archives Number of User-Agent per IP 39 One IP with User-Agent ≥20 = lying Robot
  40. 40. Access Patterns for Robots and Humans in Web Archives Robots.txt file • Session that contains an access for robot.txt is a robot 40 0.182.141.149 - - [02/Feb/2012:06:20:46 +0000] "GET http://web.archive.org/robots.txt HTTP/1.0" 200 125 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)" 0.182.141.149 - - [02/Feb/2012:06:20:19 +0000] "GET http://wayback.archive.org/web/*/http://www.devilscafe.in HTTP/1.1" 404 2168 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)" 0.182.141.149 - - [02/Feb/2012:06:21:19 +0000] "GET http://wayback.archive.org/web/*/http://www.genie.co.il HTTP/1.1" 200 96205 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)"
  41. 41. Access Patterns for Robots and Humans in Web Archives 6 requests, 2 seconds  robot 41 0.182.141.149 - - [02/Feb/2012:07:00:01 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 106433 “-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:01 +0000] "GET http://wayback.archive.org/web/*/http://www.bbc.com HTTP/1.1" 200 566433 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET http://wayback.archive.org/web/*/http://www.google.com HTTP/1.1" 200 96433 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET http://wayback.archive.org/web/*/http://www.yahoo.com HTTP/1.1" 200 933333 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET http://wayback.archive.org/web/*/http://www.bing.com HTTP/1.1" 200 964333 “-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:3 +0000] "GET http://wayback.archive.org/web/*/http://www.jcdl.org HTTP/1.1" 200 123233 “-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
  42. 42. Access Patterns for Robots and Humans in Web Archives 3 requests, 520 seconds (9 minutes)  human 42 0.11.160.13 - - [02/Feb/2012:07:00:00 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 106433 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.11.160.13 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/20100330042821/http://www.cnn.com HTTP/1.1" 200 566433 " http://wayback.archive.org/web/*/http://www.cnn.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.11.160.13 - - [02/Feb/2012:07:08:00 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 " http://wayback.archive.org/web/*/http://www.cnn.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
  43. 43. Access Patterns for Robots and Humans in Web Archives 0.5 is a Good Browsing Speed Threshold for Distinguishing Robots and Humans (Nithya et al. 2012 , Reddy et al. 2012) 43 Browsing Speed (BS) BS = 𝑠𝑒𝑠𝑠𝑖𝑜𝑛 𝑙𝑒𝑛𝑔𝑡ℎ 𝑠𝑒𝑠𝑠𝑖𝑜𝑛 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝐵𝑆 = ≤ 0.5 𝐻𝑢𝑚𝑎𝑛𝑠 > 0.5 𝑅𝑜𝑏𝑜𝑡𝑠
  44. 44. Access Patterns for Robots and Humans in Web Archives Image-to-HTML Ratio 44 If I download these, I’m not a robot
  45. 45. Access Patterns for Robots and Humans in Web Archives Image-to-HTML Ratio • The ratio between the number of image files and the number of HTML files per session • Robots sessions are less than 1:10 image to HTML ratio, as suggested by Stassopoulou et al. 2005 45
  46. 46. Access Patterns for Robots and Humans in Web Archives Image-to-HTML is the best in detecting robots 46
  47. 47. Access Patterns for Robots and Humans in Web Archives Traffic Analysis • Records remaining after cleaning: 21.3% (426,317 out of 2M) • Unique IPs: 21,932 • Users: 33,841 • Sessions: 37,634 47
  48. 48. Access Patterns for Robots and Humans in Web Archives Robots have longer sessions than humans 48
  49. 49. Access Patterns for Robots and Humans in Web Archives Humans spend more time than Robots 49
  50. 50. Access Patterns for Robots and Humans in Web Archives Robots outnumber humans in terms of: 50 Sessions 10 1 Raw HTTP Accesses 5 4 MB Transferred 4 1
  51. 51. Access Patterns for Robots and Humans in Web Archives User Access Patterns in Web Archives • Dip • Dive • Slide • Skim 51
  52. 52. Access Patterns for Robots and Humans in Web Archives Dip: simple access to TimeMap or memento 52 TimeMap Memento
  53. 53. Access Patterns for Robots and Humans in Web Archives Dive: different pages at approximately the same archive time 53 November 12, 2009 11:55:54 November 12, 2009 05:37:22 November 12, 2009 05:38:02
  54. 54. Access Patterns for Robots and Humans in Web Archives Slide: the same page at different archive times 54 March 18, 2013 13:56:00 November 15, 2009 05:33:01 July 31, 2006 23:55:45
  55. 55. Access Patterns for Robots and Humans in Web Archives Skim: lists of TimeMaps 55 http://web.archive.org/web/*/ http://cnn.com/ http://web.archive.org/web/*/ http://www.bbc.com/ http://web.archive.org/web/*/ http://www.nytimes.com/
  56. 56. Access Patterns for Robots and Humans in Web Archives Everybody Dips, Humans Dive, Robots Skim 56 Robots (34,203 sessions) Humans (3,431 sessions)
  57. 57. Access Patterns for Robots and Humans in Web Archives Pattern Length 57 Slide length = 4 Skim length = 3
  58. 58. Access Patterns for Robots and Humans in Web Archives Small Medians, Large Standard Deviations 58
  59. 59. Access Patterns for Robots and Humans in Web Archives Only recent past exhibits locality of reference 59
  60. 60. Access Patterns for Robots and Humans in Web Archives Only recent past exhibits locality of reference 60 Cache replacement policies should favor recent past
  61. 61. Access Patterns for Robots and Humans in Web Archives Conclusions • We introduced traffic analysis for the Wayback Machine • We discovered that robots outnumber humans – 10:1 in terms of sessions – 5:4 in terms of raw, unfiltered requests – 4:1 in terms of megabytes transferred – Robots need APIs http://arxiv.org/abs/1305.5959 • We Identified four major web archive access patterns – Dip – Slide – Dive – Skim • Only recent past exhibits locality of reference 61
  62. 62. Access Patterns for Robots and Humans in Web Archives Extra Slides 62
  63. 63. Access Patterns for Robots and Humans in Web Archives The Features of the Samples Days Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7 Feb 8 Mean SD SE Duration 0:33:12 0:31:15 0:40:34 0:42:57 0:29:35 0:25:45 0:24:33 0:32:33 0:06:29 0:02:27 GET 98.4% 99.3% 97.7% 97.9% 99.4% 99.7% 99.8% 99% 0.8% 0.3% Embedded 47.4% 34.8% 43.7% 42.7% 41.9% 44.7% 46.8% 43.1% 3.9% 1.5% SI Robots 6.2% 12.0% 7.7% 7.7% 2.9% 3.5% 3.8% 6.3% 3.0% 1.1% NullRef 42.6% 56.6% 47.5% 47.0% 49.4% 42.6% 43.9% 47.1% 4.6% 1.7% s2xx 33.7% 32.4% 34.2% 33.2% 34.1% 33.4% 33.6% 33.5% 0.6% 0.2% s3xx 51.8% 52.3% 50.8% 52.2% 51.7% 51.9% 53.2% 52.0% 0.7% 0.3% s4xx 11.7% 13.1% 12.0% 11.6% 11.2% 10.3% 10.1% 11.4% 0.9% 0.4% s5xx 2.8% 2.3% 3.0% 2.9% 3.0% 4.4% 3.1% 3.1% 0.6% 0.2% Cleaned 21.3% 23.0% 17.6% 17.7% 20.7% 18.1% 16.9% 19.3% 2.2% 0.8% Sessions 37,634 31,731 32,159 28,750 36,087 35,848 32,117 33,475 2,896 1,094 63
  64. 64. Access Patterns for Robots and Humans in Web Archives Very Small Standard Errors among Samples 64 Days Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7 Feb 8 Mean SD SE Duration 0:33:12 0:31:15 0:40:34 0:42:57 0:29:35 0:25:45 0:24:33 0:32:33 0:06:29 0:02:27 GET 98.4% 99.3% 97.7% 97.9% 99.4% 99.7% 99.8% 99% 0.8% 0.3% Embedded 47.4% 34.8% 43.7% 42.7% 41.9% 44.7% 46.8% 43.1% 3.9% 1.5% SI Robots 6.2% 12.0% 7.7% 7.7% 2.9% 3.5% 3.8% 6.3% 3.0% 1.1% NullRef 42.6% 56.6% 47.5% 47.0% 49.4% 42.6% 43.9% 47.1% 4.6% 1.7% s2xx 33.7% 32.4% 34.2% 33.2% 34.1% 33.4% 33.6% 33.5% 0.6% 0.2% s3xx 51.8% 52.3% 50.8% 52.2% 51.7% 51.9% 53.2% 52.0% 0.7% 0.3% s4xx 11.7% 13.1% 12.0% 11.6% 11.2% 10.3% 10.1% 11.4% 0.9% 0.4% s5xx 2.8% 2.3% 3.0% 2.9% 3.0% 4.4% 3.1% 3.1% 0.6% 0.2% Cleaned 21.3% 23.0% 17.6% 17.7% 20.7% 18.1% 16.9% 19.3% 2.2% 0.8% Sessions 37,634 31,731 32,159 28,750 36,087 35,848 32,117 33,475 2,896 1,094
  65. 65. Access Patterns for Robots and Humans in Web Archives Feb. 2, 2012 sample is representative Days Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7 Feb 8 Mean SD SE Duration 0:33:12 0:31:15 0:40:34 0:42:57 0:29:35 0:25:45 0:24:33 0:32:33 0:06:29 0:02:27 GET 98.4% 99.3% 97.7% 97.9% 99.4% 99.7% 99.8% 99% 0.8% 0.3% Embedded 47.4% 34.8% 43.7% 42.7% 41.9% 44.7% 46.8% 43.1% 3.9% 1.5% SI Robots 6.2% 12.0% 7.7% 7.7% 2.9% 3.5% 3.8% 6.3% 3.0% 1.1% NullRef 42.6% 56.6% 47.5% 47.0% 49.4% 42.6% 43.9% 47.1% 4.6% 1.7% s2xx 33.7% 32.4% 34.2% 33.2% 34.1% 33.4% 33.6% 33.5% 0.6% 0.2% s3xx 51.8% 52.3% 50.8% 52.2% 51.7% 51.9% 53.2% 52.0% 0.7% 0.3% s4xx 11.7% 13.1% 12.0% 11.6% 11.2% 10.3% 10.1% 11.4% 0.9% 0.4% s5xx 2.8% 2.3% 3.0% 2.9% 3.0% 4.4% 3.1% 3.1% 0.6% 0.2% Cleaned 21.3% 23.0% 17.6% 17.7% 20.7% 18.1% 16.9% 19.3% 2.2% 0.8% Sessions 37,634 31,731 32,159 28,750 36,087 35,848 32,117 33,475 2,896 1,094 65
  66. 66. Access Patterns for Robots and Humans in Web Archives Results of Data Cleaning • The records remained after cleaning are 21.3% of the requests in the raw file. 66
  67. 67. Access Patterns for Robots and Humans in Web Archives Robots outnumber humans in terms of: 67 Sessions 10 1 Raw HTTP Accesses 5 4 MB Transferred 4 1 Users # Sessions # Requests (Raw) # Transferred MB Robots 34,203 (90.9%) 1,002,573 (50.1%) 20,010 Humans 3,431 (9.10%) 810,049 (40.5%) 4,459
  68. 68. Access Patterns for Robots and Humans in Web Archives Humans exhibit Dip and Dive, while robots exhibit Dip and Skim 68 Robots Humans 328 Slides 571 Dives 1167 Slides 1942 Dives
  69. 69. Access Patterns for Robots and Humans in Web Archives The total number of mementos available for 2011 was similar to previous years. 69
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×