SlideShare a Scribd company logo
1 of 43
Access Patterns for Robots and Humans in Web ArchivesWho and What Links to the Internet Archive
Who and What Links to the
Internet Archive
Yasmin AlNoamany, Ahmed AlSum, Michele C. Weigle, Michael L. Nelson
Computer Science Department
Old Dominion University, Norfolk, VA
mln@cs.odu.edu
Access Patterns for Robots and Humans in Web Archives 2Who and What Links to the Internet Archive
Motivation
• What do web archive users look for and where
do they come from?
EnglishRussianJapanese English
English
English
English
Spanish
Russian
Russian
Russian
Spanish
Spanish
German
German
German
Japanese
Japanese
Japanese
Access Patterns for Robots and Humans in Web Archives 3Who and What Links to the Internet Archive
Methodology
Access Patterns for Robots and Humans in Web Archives 4Who and What Links to the Internet Archive
Data Set
• Six million recordsfrom Internet Archive’s
Wayback Machine web server logs of
February 2, 2012
• Data set statistics
Get Embedded
Resources
Null
Referrers
2xx 3xx 4xx 5xx Humans Robots
99% 43% 47% 33% 51% 12% 4% 1.5% 18.8%
The percentage of humans
and robots remaining
after cleaning
Access Patterns for Robots and Humans in Web Archives 5Who and What Links to the Internet Archive
Sample from 6pm-midnight
(prime Internet hours)
Access Patterns for Robots and Humans in Web Archives 6Who and What Links to the Internet Archive
Wayback Machine Access Logs
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com
HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77
Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/*/http://www.cnn.com
• Protocol: HTTP/1.1
• HTTP status code: 200
• Bytes sent: 96433
• Referring URI: http://www.archive.org/web/web.php
• User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like
Gecko) Chrome/16.0.912.77 Safari/535.7
Access Patterns for Robots and Humans in Web Archives 7Who and What Links to the Internet Archive
Wayback Machine Access Logs
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com
HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77
Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/*/http://www.cnn.com
• Protocol: HTTP/1.1
• HTTP status code: 200
• Bytes sent: 96433
• Referring URI: http://www.archive.org/web/web.php
• User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like
Gecko) Chrome/16.0.912.77 Safari/535.7
IPs anonymized by Internet Archive
Access Patterns for Robots and Humans in Web Archives 8Who and What Links to the Internet Archive
Pre-Processing
• Data Cleaning
• Session Identification
• Robot Detection AlNoamany 2013
8
Access Patterns for Robots and Humans in Web Archives 9Who and What Links to the Internet Archive
Data Cleaning
9
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
http://web.archive.org/web/20070519015308/
http://www.jcdl.org/
Access Patterns for Robots and Humans in Web Archives 10Who and What Links to the Internet Archive
Embedded Resources
10
http://web.archive.org/web/20070519015308/
http://www.jcdl.org/
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
Access Patterns for Robots and Humans in Web Archives 11Who and What Links to the Internet Archive
Embedded Resources
11
http://web.archive.org/web/20070519015308/
http://www.jcdl.org/
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
Access Patterns for Robots and Humans in Web Archives 12Who and What Links to the Internet Archive
Static Resources
12
http://web.archive.org/web/20070519015308/
http://www.jcdl.org/
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
Access Patterns for Robots and Humans in Web Archives 13Who and What Links to the Internet Archive
Static Resources
13
http://web.archive.org/web/20070519015308/
http://www.jcdl.org/
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
Access Patterns for Robots and Humans in Web Archives 14Who and What Links to the Internet Archive
Invalid Requests
14
http://web.archive.org/web/20100102003557/
about:blank
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
Access Patterns for Robots and Humans in Web Archives 15Who and What Links to the Internet Archive
Invalid Requests
15
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
http://web.archive.org/web/20100102003557/
about:blank
Access Patterns for Robots and Humans in Web Archives 16Who and What Links to the Internet Archive
Requests that had 3xx Status Code
16
http://web.archive.org/web/20130114160045/
http://www.jcdl.org/0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
Access Patterns for Robots and Humans in Web Archives 17Who and What Links to the Internet Archive
Requests that had 3xx Status Code
17
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
http://web.archive.org/web/20130114160045/
http://www.jcdl.org/
curl -I "http://web.archive.org/web/20140004100000/http://www.jcdl.org/"
HTTP/1.1 302 Moved Temporarily
Server: Tengine/1.4.3
Date: Tue, 02 Jul 2013 19:48:59 GMT
Content-Type: application/octet-stream
Content-Length: 0
Connection: keep-alive
set-cookie: wayback_server=10; Domain=archive.org; Path=/; Expires=Thu, 01-Aug-13 19:48:59 GMT;
Location: /web/20130114160045/http://www.jcdl.org/
Access Patterns for Robots and Humans in Web Archives 18Who and What Links to the Internet Archive
Requests that had 3xx Status Code
18
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308/http
://www.jcdl.org/HTTP/1.1" 200 2137 "-"
"Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20070519015308im_/h
ttp://www.jcdl.org/images/jcdl2007-edie.jpg
HTTP/1.1" 200 2137 "-" "Mozilla/5.0"
0.11.160.135 [02/Feb/2012:00:01:03] "GET
http://staticweb.archive.org/images/toolbar/wa
yback-toolbar-logo.png HTTP/1.1" 200 3700 "–"
"Mozilla/5.0"
0.151.147.108 [02/Feb/2012:00:01:03] "GET
http://web.archive.org/web/20100102003557/abou
t:blank HTTP/1.1" 302 0 "www.xx.com"
"Mozilla/4.0"
0.26.129.146 - - [02/Feb/2012:00:01:54] "GET
http://web.archive.org/web/20140004100000/http
://www.jcdl.org/ HTTP/1.1" 302 0 "-"
"Mozilla/5.0"
http://web.archive.org/web/20130114160045/
http://www.jcdl.org/
Access Patterns for Robots and Humans in Web Archives 19Who and What Links to the Internet Archive
Session: Set of Web Pages Requested
by a Particular User
19
1 mins 4 mins
3 mins 9 mins
p1 p2 p3
p4 p5
Access Patterns for Robots and Humans in Web Archives 20Who and What Links to the Internet Archive
Session: Set of Web Pages Requested
by a Particular User
20
1 mins 4 mins
3 mins 9 mins
p1 p2 p3
p4 p5
Time between two
requests ≤ 10 mins
Access Patterns for Robots and Humans in Web Archives 21Who and What Links to the Internet Archive
Session Identification
• Threshold timeout: 10 minutes Liu et al.
2007, Spiliopoulou et al. 2003
• Grouping: based on the IP and User-
Agent
21
Access Patterns for Robots and Humans in Web Archives 22Who and What Links to the Internet Archive
Robot Detection is a Big Challenge
22
I’m not a
robot
Access Patterns for Robots and Humans in Web Archives 23Who and What Links to the Internet Archive
User-Agent Check
0.182.141.149 - -
[02/Feb/2012:00:01:51 +0000] "GET
http://wayback.archive.org/web/199906
01000000*/http://www.belizefirst.com/
HTTP/1.0" 200 98507 "-"
"Python-urllib/1.17"
23
Access Patterns for Robots and Humans in Web Archives 24Who and What Links to the Internet Archive
Number of User-Agents per IP
24
Access Patterns for Robots and Humans in Web Archives 25Who and What Links to the Internet Archive
Number of User-Agents per IP
25
One IP with User-Agent ≥20 = lying Robot
Access Patterns for Robots and Humans in Web Archives 26Who and What Links to the Internet Archive
Robots.txt File
• Session that contains an access for robots.txt
is a robot
26
0.182.141.149 - - [02/Feb/2012:06:20:46 +0000] "GET
http://web.archive.org/robots.txt HTTP/1.0" 200 125 "-"
"Mozilla/5.0 (compatible; MJ12bot/v1.4.1;
http://www.majestic12.co.uk/bot.php?+)"
0.182.141.149 - - [02/Feb/2012:06:20:19 +0000] "GET
http://wayback.archive.org/web/*/http://www.devilscafe.in
HTTP/1.1" 404 2168 "-" "Mozilla/5.0 (compatible;
MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)"
0.182.141.149 - - [02/Feb/2012:06:21:19 +0000] "GET
http://wayback.archive.org/web/*/http://www.genie.co.il
HTTP/1.1" 200 96205 "-" "Mozilla/5.0 (compatible;
MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)"
Access Patterns for Robots and Humans in Web Archives 27Who and What Links to the Internet Archive
6 Requests, 2 Seconds  Robot
27
0.182.141.149 - - [02/Feb/2012:07:00:01 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 106433 “-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
0.182.141.149 - - [02/Feb/2012:07:00:01 +0000] "GET
http://wayback.archive.org/web/*/http://www.bbc.com HTTP/1.1" 200 566433 "-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET
http://wayback.archive.org/web/*/http://www.google.com HTTP/1.1" 200 96433 "-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET
http://wayback.archive.org/web/*/http://www.yahoo.com HTTP/1.1" 200 933333 "-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET
http://wayback.archive.org/web/*/http://www.bing.com HTTP/1.1" 200 964333 “-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
0.182.141.149 - - [02/Feb/2012:07:00:3 +0000] "GET
http://wayback.archive.org/web/*/http://www.jcdl.org HTTP/1.1" 200 123233 “-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
Access Patterns for Robots and Humans in Web Archives 28Who and What Links to the Internet Archive
3 Requests, 520 Seconds
(9 Minutes)  Human
28
0.11.160.13 - - [02/Feb/2012:07:00:00 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 106433 "-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
0.11.160.13 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/20100330042821/http://www.cnn.com HTTP/1.1" 200
566433 " http://wayback.archive.org/web/*/http://www.cnn.com" "Mozilla/5.0
(Macintosh; Intel Mac OS X 10_6_8)
0.11.160.13 - - [02/Feb/2012:07:08:00 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "
http://wayback.archive.org/web/*/http://www.cnn.com" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8)
Access Patterns for Robots and Humans in Web Archives 29Who and What Links to the Internet Archive
Image-to-HTML Ratio
29
If I download
these, I’m
not a robot
Access Patterns for Robots and Humans in Web Archives 30Who and What Links to the Internet Archive
Image-to-HTML Ratio
• The ratio between the number of image files
and the number of HTML files per session
• Robots sessions are less than 1:10 image to
HTML ratio(Stassopoulou et al. 2005)
30
Access Patterns for Robots and Humans in Web Archives 31Who and What Links to the Internet Archive
finish
• Who link to the archive?
• How do people reach web archives?
• Why they link to the Archive?
• Deep links?
Found
in
Archive
YesNo
Check the
language of the
archived page
Found
on Live
Web
Check the
Status Code
YesNo
Check the
language
Check status
code on the
live web
Check
existence in
other archives
Access Patterns for Robots and Humans in Web Archives 32Who and What Links to the Internet Archive
Languages for Pages in the Archive
Access Patterns for Robots and Humans in Web Archives 33Who and What Links to the Internet Archive
Languages for Requested Pages
NOT in the Archive
Access Patterns for Robots and Humans in Web Archives 34Who and What Links to the Internet Archive
Most Languages Self-Link
Access Patterns for Robots and Humans in Web Archives 35Who and What Links to the Internet Archive
The Existence of the Archived
Pages on the Live Web
Humans Robots
URI-Rs available on live web 36.4% 62.5%
URI-Rs missing from live web 63.6% 37.5%
Humans come to the archive
because they can’t find web
pages on the live web
Access Patterns for Robots and Humans in Web Archives 36Who and What Links to the Internet Archive
The Existence of Unarchived Pages
on the Live Web
Humans Robots
URI-Rs available on live web 25.4% 33.2%
URI-Rs missing from live web 74.6% 66.8%
Access Patterns for Robots and Humans in Web Archives 37Who and What Links to the Internet Archive
Existence in Other Web Archives
Web Archive #URI-R #URI-M
Internet Archive (2013) 56,503 1,657,264
The National Archives 787 15,354
ArchiefWeb 47 18,347
Archive-It 41 4,682
UK Web Archive 38 12,277
Library of Congress 35 1,092
WebCite 29 1,104
The number of the requested pages
not in the archive is 211,825 (2012)
Access Patterns for Robots and Humans in Web Archives 38Who and What Links to the Internet Archive
82% of Human Sessions Have Referring URIS
WebSite Percentage Description
en.wikipedia.org 12.9% Wikipedia
archive.org 11.9% IA Home Page
reddit.com 10.2% Social News Web Site
google.TLD 9.9% Search Engine
info-poland.buffalo.edu 1.5% Polish Studies
de.wikipedia.org 1.4% Wikipedia
cracked.com 1.2% Humor Site
snopes.com 1.1% Urban Legends Reference Pages
facebook.com 0.9% Social Media
crochetpatterncentral.com 0.9% Crocheting Hobbies
Access Patterns for Robots and Humans in Web Archives 39Who and What Links to the Internet Archive
Many European Domains Link to IA
ccTLD .com .uk .de .ca .jp .pl .nl .ru .fr .br
Percentage 56.7% 6.0% 5.3% 4.8% 3.7% 2.2% 1.9% 1.7% 1.5% 1.4%
TLD .com .org .net .jp .ru .de .edu .to .uk .info
Percentage 45.4% 33.9% 8.4% 1.8% 1.4% 1.4% 1.1% 0.7% 0.6% 0.5%
The top 10 TLDs of the referrers.
The top 10 ccTLDs of Google search referrers.
Access Patterns for Robots and Humans in Web Archives 40Who and What Links to the Internet Archive
Most of the Links (86%) Are to Mementos
Access Patterns for Robots and Humans in Web Archives 41Who and What Links to the Internet Archive
Significant Bias for the Recent Past
Access Patterns for Robots and Humans in Web Archives 42Who and What Links to the Internet Archive
For 83% of Externally Linked Mementos,
Corresponding Original URI is 404 on Live Web
Access Patterns for Robots and Humans in Web Archives 43Who and What Links to the Internet Archive
Conclusions
• English is the most common language, followed
by many European languages, and Japanese &
Vietnamese
• Languages self-link (and link to English)
• 82% of human sessions have referrals
• 86% of the referring web pages link deeply to
mementos
• 83% of the links to these mementos are because
their corresponding URI-Rs do not exist on the
live web

More Related Content

What's hot

Capture All the URLS: First Steps in Web Archiving
Capture All the URLS: First Steps in Web ArchivingCapture All the URLS: First Steps in Web Archiving
Capture All the URLS: First Steps in Web ArchivingKristen Yarmey
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Mat Kelly
 
Case Study: UNC Eshelman School of Pharmacy Web Site
Case Study: UNC Eshelman School of Pharmacy Web SiteCase Study: UNC Eshelman School of Pharmacy Web Site
Case Study: UNC Eshelman School of Pharmacy Web Sitejzheel
 
Link Persistence, Website Persistence
Link Persistence, Website PersistenceLink Persistence, Website Persistence
Link Persistence, Website Persistencenullhandle
 
Summarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesSummarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesMichael Nelson
 
The Next Big Thing is Web 3.0. Catch It If You Can
The Next Big Thing is Web 3.0. Catch It If You Can The Next Big Thing is Web 3.0. Catch It If You Can
The Next Big Thing is Web 3.0. Catch It If You Can Judy O'Connell
 
Recommending Archived Webpages Using Only The URI
Recommending Archived Webpages Using Only The URIRecommending Archived Webpages Using Only The URI
Recommending Archived Webpages Using Only The URILulwahMA
 
Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesYasmin AlNoamany, PhD
 
Beyond Open Access: Open Data, Web services, and Semantics (the Open Context ...
Beyond Open Access: Open Data, Web services, and Semantics (the Open Context ...Beyond Open Access: Open Data, Web services, and Semantics (the Open Context ...
Beyond Open Access: Open Data, Web services, and Semantics (the Open Context ...Sarah Whitcher Kansa
 
Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesYasmin AlNoamany, PhD
 
An Introduction to Linked Data for Librarians (2018-06-28)
An Introduction to Linked Data for Librarians (2018-06-28)An Introduction to Linked Data for Librarians (2018-06-28)
An Introduction to Linked Data for Librarians (2018-06-28)Cliff Landis
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web ArchivesMichael Nelson
 
The Internet and Self-directed Learning
The Internet and Self-directed LearningThe Internet and Self-directed Learning
The Internet and Self-directed LearningGeorge Roberts
 
From Web 2.0 to Web 3.0: Yesterday, Today, Tomorrow. Where the Technology is ...
From Web 2.0 to Web 3.0: Yesterday, Today, Tomorrow. Where the Technology is ...From Web 2.0 to Web 3.0: Yesterday, Today, Tomorrow. Where the Technology is ...
From Web 2.0 to Web 3.0: Yesterday, Today, Tomorrow. Where the Technology is ...Pavlinka Kovatcheva
 
Answers to usual issues in getting started with consuming Linked Data (2010)
Answers to usual issues in getting started with consuming Linked Data (2010)Answers to usual issues in getting started with consuming Linked Data (2010)
Answers to usual issues in getting started with consuming Linked Data (2010)Olaf Hartig
 
Why We Need Multiple Archives
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple ArchivesMichael Nelson
 

What's hot (17)

Capture All the URLS: First Steps in Web Archiving
Capture All the URLS: First Steps in Web ArchivingCapture All the URLS: First Steps in Web Archiving
Capture All the URLS: First Steps in Web Archiving
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count
 
Case Study: UNC Eshelman School of Pharmacy Web Site
Case Study: UNC Eshelman School of Pharmacy Web SiteCase Study: UNC Eshelman School of Pharmacy Web Site
Case Study: UNC Eshelman School of Pharmacy Web Site
 
Link Persistence, Website Persistence
Link Persistence, Website PersistenceLink Persistence, Website Persistence
Link Persistence, Website Persistence
 
Summarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesSummarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniques
 
The Next Big Thing is Web 3.0. Catch It If You Can
The Next Big Thing is Web 3.0. Catch It If You Can The Next Big Thing is Web 3.0. Catch It If You Can
The Next Big Thing is Web 3.0. Catch It If You Can
 
Recommending Archived Webpages Using Only The URI
Recommending Archived Webpages Using Only The URIRecommending Archived Webpages Using Only The URI
Recommending Archived Webpages Using Only The URI
 
Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web Archives
 
Beyond Open Access: Open Data, Web services, and Semantics (the Open Context ...
Beyond Open Access: Open Data, Web services, and Semantics (the Open Context ...Beyond Open Access: Open Data, Web services, and Semantics (the Open Context ...
Beyond Open Access: Open Data, Web services, and Semantics (the Open Context ...
 
Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web Archives
 
An Introduction to Linked Data for Librarians (2018-06-28)
An Introduction to Linked Data for Librarians (2018-06-28)An Introduction to Linked Data for Librarians (2018-06-28)
An Introduction to Linked Data for Librarians (2018-06-28)
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web Archives
 
The Internet and Self-directed Learning
The Internet and Self-directed LearningThe Internet and Self-directed Learning
The Internet and Self-directed Learning
 
From Web 2.0 to Web 3.0: Yesterday, Today, Tomorrow. Where the Technology is ...
From Web 2.0 to Web 3.0: Yesterday, Today, Tomorrow. Where the Technology is ...From Web 2.0 to Web 3.0: Yesterday, Today, Tomorrow. Where the Technology is ...
From Web 2.0 to Web 3.0: Yesterday, Today, Tomorrow. Where the Technology is ...
 
Hartman Summit 09
Hartman Summit 09Hartman Summit 09
Hartman Summit 09
 
Answers to usual issues in getting started with consuming Linked Data (2010)
Answers to usual issues in getting started with consuming Linked Data (2010)Answers to usual issues in getting started with consuming Linked Data (2010)
Answers to usual issues in getting started with consuming Linked Data (2010)
 
Why We Need Multiple Archives
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple Archives
 

Viewers also liked

Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web ArchivesMichael Nelson
 
Evaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived PagesEvaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived PagesMichael Nelson
 
We Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesWe Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesMichael Nelson
 
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...Michael Nelson
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingUsing Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingYasmin AlNoamany, PhD
 
When Should I Make Preservation Copies of Myself?
When Should I Make Preservation Copies of Myself?�When Should I Make Preservation Copies of Myself?�
When Should I Make Preservation Copies of Myself?Michael Nelson
 
Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member Michael Nelson
 
Software as a Well-Formed Research Object
Software as a Well-Formed Research ObjectSoftware as a Well-Formed Research Object
Software as a Well-Formed Research ObjectYasmin AlNoamany, PhD
 
Assessing the Quality of Web Archives
Assessing the Quality of Web ArchivesAssessing the Quality of Web Archives
Assessing the Quality of Web ArchivesMichael Nelson
 
Profiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content LanguageProfiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content LanguageMichael Nelson
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptMichael Nelson
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionSawood Alam
 
@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015Michael Nelson
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Michael Nelson
 
Why Care About the Past?
Why Care About the Past?Why Care About the Past?
Why Care About the Past?Michael Nelson
 
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project
OAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange ProjectOAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange Project
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange ProjectMichael Nelson
 

Viewers also liked (16)

Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web Archives
 
Evaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived PagesEvaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived Pages
 
We Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesWe Need Multiple, Independent Web Archives
We Need Multiple, Independent Web Archives
 
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingUsing Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through Storytelling
 
When Should I Make Preservation Copies of Myself?
When Should I Make Preservation Copies of Myself?�When Should I Make Preservation Copies of Myself?�
When Should I Make Preservation Copies of Myself?
 
Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member
 
Software as a Well-Formed Research Object
Software as a Well-Formed Research ObjectSoftware as a Well-Formed Research Object
Software as a Well-Formed Research Object
 
Assessing the Quality of Web Archives
Assessing the Quality of Web ArchivesAssessing the Quality of Web Archives
Assessing the Quality of Web Archives
 
Profiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content LanguageProfiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content Language
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
 
Why Care About the Past?
Why Care About the Past?Why Care About the Past?
Why Care About the Past?
 
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project
OAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange ProjectOAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange Project
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project
 

Similar to Who and What Links to the Internet Archive

Avtar's ppt
Avtar's pptAvtar's ppt
Avtar's pptmak57
 
OpenSocial and Mixi platform
OpenSocial and Mixi platformOpenSocial and Mixi platform
OpenSocial and Mixi platformPham Thinh
 
Access Patterns for Robots and Humans in Web Archives
Access Patterns for Robots and Humans in Web ArchivesAccess Patterns for Robots and Humans in Web Archives
Access Patterns for Robots and Humans in Web ArchivesYasmin AlNoamany, PhD
 
Preprocessing of Web Log Data for Web Usage Mining
Preprocessing of Web Log Data for Web Usage MiningPreprocessing of Web Log Data for Web Usage Mining
Preprocessing of Web Log Data for Web Usage MiningAmir Masoud Sefidian
 
WebQuilt: Capturing and Visualizing the Web Experience at WWW10
WebQuilt: Capturing and Visualizing the Web Experience at WWW10WebQuilt: Capturing and Visualizing the Web Experience at WWW10
WebQuilt: Capturing and Visualizing the Web Experience at WWW10Jason Hong
 
From zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and ElasticsearchFrom zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and ElasticsearchRafał Kuć
 
From Zero to Hero - Centralized Logging with Logstash & Elasticsearch
From Zero to Hero - Centralized Logging with Logstash & ElasticsearchFrom Zero to Hero - Centralized Logging with Logstash & Elasticsearch
From Zero to Hero - Centralized Logging with Logstash & ElasticsearchSematext Group, Inc.
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesSawood Alam
 
Browsers_SameOriginPolicy_CORS_ContentSecurityPolicy
Browsers_SameOriginPolicy_CORS_ContentSecurityPolicyBrowsers_SameOriginPolicy_CORS_ContentSecurityPolicy
Browsers_SameOriginPolicy_CORS_ContentSecurityPolicysubbul
 
Lesson 6 web based attacks
Lesson 6 web based attacksLesson 6 web based attacks
Lesson 6 web based attacksFrank Victory
 
Consuming REST Services in BizTalk 2010
Consuming REST Services in BizTalk 2010Consuming REST Services in BizTalk 2010
Consuming REST Services in BizTalk 2010Daniel Toomey
 
Real-time data analysis using ELK
Real-time data analysis using ELKReal-time data analysis using ELK
Real-time data analysis using ELKJettro Coenradie
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Artefactual Systems - AtoM
 
Top 10 HTML5 Features for Oracle Cloud Developers
Top 10 HTML5 Features for Oracle Cloud DevelopersTop 10 HTML5 Features for Oracle Cloud Developers
Top 10 HTML5 Features for Oracle Cloud DevelopersBrian Huff
 
HTTP cookie hijacking in the wild: security and privacy implications
HTTP cookie hijacking in the wild: security and privacy implicationsHTTP cookie hijacking in the wild: security and privacy implications
HTTP cookie hijacking in the wild: security and privacy implicationsPriyanka Aash
 
Web sockets in java EE 7 - JavaOne 2013
Web sockets in java EE 7 - JavaOne 2013Web sockets in java EE 7 - JavaOne 2013
Web sockets in java EE 7 - JavaOne 2013Siva Arunachalam
 
An Introduction To IT Security And Privacy - Servers And More
An Introduction To IT Security And Privacy - Servers And MoreAn Introduction To IT Security And Privacy - Servers And More
An Introduction To IT Security And Privacy - Servers And MoreBlake Carver
 
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreAndy Powell
 

Similar to Who and What Links to the Internet Archive (20)

Avtar's ppt
Avtar's pptAvtar's ppt
Avtar's ppt
 
OpenSocial and Mixi platform
OpenSocial and Mixi platformOpenSocial and Mixi platform
OpenSocial and Mixi platform
 
final ppt.pptx
final ppt.pptxfinal ppt.pptx
final ppt.pptx
 
final ppt.pptx
final ppt.pptxfinal ppt.pptx
final ppt.pptx
 
Access Patterns for Robots and Humans in Web Archives
Access Patterns for Robots and Humans in Web ArchivesAccess Patterns for Robots and Humans in Web Archives
Access Patterns for Robots and Humans in Web Archives
 
Preprocessing of Web Log Data for Web Usage Mining
Preprocessing of Web Log Data for Web Usage MiningPreprocessing of Web Log Data for Web Usage Mining
Preprocessing of Web Log Data for Web Usage Mining
 
WebQuilt: Capturing and Visualizing the Web Experience at WWW10
WebQuilt: Capturing and Visualizing the Web Experience at WWW10WebQuilt: Capturing and Visualizing the Web Experience at WWW10
WebQuilt: Capturing and Visualizing the Web Experience at WWW10
 
From zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and ElasticsearchFrom zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and Elasticsearch
 
From Zero to Hero - Centralized Logging with Logstash & Elasticsearch
From Zero to Hero - Centralized Logging with Logstash & ElasticsearchFrom Zero to Hero - Centralized Logging with Logstash & Elasticsearch
From Zero to Hero - Centralized Logging with Logstash & Elasticsearch
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
 
Browsers_SameOriginPolicy_CORS_ContentSecurityPolicy
Browsers_SameOriginPolicy_CORS_ContentSecurityPolicyBrowsers_SameOriginPolicy_CORS_ContentSecurityPolicy
Browsers_SameOriginPolicy_CORS_ContentSecurityPolicy
 
Lesson 6 web based attacks
Lesson 6 web based attacksLesson 6 web based attacks
Lesson 6 web based attacks
 
Consuming REST Services in BizTalk 2010
Consuming REST Services in BizTalk 2010Consuming REST Services in BizTalk 2010
Consuming REST Services in BizTalk 2010
 
Real-time data analysis using ELK
Real-time data analysis using ELKReal-time data analysis using ELK
Real-time data analysis using ELK
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
 
Top 10 HTML5 Features for Oracle Cloud Developers
Top 10 HTML5 Features for Oracle Cloud DevelopersTop 10 HTML5 Features for Oracle Cloud Developers
Top 10 HTML5 Features for Oracle Cloud Developers
 
HTTP cookie hijacking in the wild: security and privacy implications
HTTP cookie hijacking in the wild: security and privacy implicationsHTTP cookie hijacking in the wild: security and privacy implications
HTTP cookie hijacking in the wild: security and privacy implications
 
Web sockets in java EE 7 - JavaOne 2013
Web sockets in java EE 7 - JavaOne 2013Web sockets in java EE 7 - JavaOne 2013
Web sockets in java EE 7 - JavaOne 2013
 
An Introduction To IT Security And Privacy - Servers And More
An Introduction To IT Security And Privacy - Servers And MoreAn Introduction To IT Security And Privacy - Servers And More
An Introduction To IT Security And Privacy - Servers And More
 
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
 

More from Michael Nelson

Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Michael Nelson
 
Uncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesUncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesMichael Nelson
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsMichael Nelson
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsMichael Nelson
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
 

More from Michael Nelson (9)

Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035
 
Uncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesUncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pages
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed Originals
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed Originals
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 

Recently uploaded

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

Who and What Links to the Internet Archive

  • 1. Access Patterns for Robots and Humans in Web ArchivesWho and What Links to the Internet Archive Who and What Links to the Internet Archive Yasmin AlNoamany, Ahmed AlSum, Michele C. Weigle, Michael L. Nelson Computer Science Department Old Dominion University, Norfolk, VA mln@cs.odu.edu
  • 2. Access Patterns for Robots and Humans in Web Archives 2Who and What Links to the Internet Archive Motivation • What do web archive users look for and where do they come from? EnglishRussianJapanese English English English English Spanish Russian Russian Russian Spanish Spanish German German German Japanese Japanese Japanese
  • 3. Access Patterns for Robots and Humans in Web Archives 3Who and What Links to the Internet Archive Methodology
  • 4. Access Patterns for Robots and Humans in Web Archives 4Who and What Links to the Internet Archive Data Set • Six million recordsfrom Internet Archive’s Wayback Machine web server logs of February 2, 2012 • Data set statistics Get Embedded Resources Null Referrers 2xx 3xx 4xx 5xx Humans Robots 99% 43% 47% 33% 51% 12% 4% 1.5% 18.8% The percentage of humans and robots remaining after cleaning
  • 5. Access Patterns for Robots and Humans in Web Archives 5Who and What Links to the Internet Archive Sample from 6pm-midnight (prime Internet hours)
  • 6. Access Patterns for Robots and Humans in Web Archives 6Who and What Links to the Internet Archive Wayback Machine Access Logs 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com • Protocol: HTTP/1.1 • HTTP status code: 200 • Bytes sent: 96433 • Referring URI: http://www.archive.org/web/web.php • User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7
  • 7. Access Patterns for Robots and Humans in Web Archives 7Who and What Links to the Internet Archive Wayback Machine Access Logs 0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7" • Client IP: 0.247.222.86 • Access time: 02/Feb/2012:07:03:46 +0000 • HTTP request method: GET • URI: http://wayback.archive.org/web/*/http://www.cnn.com • Protocol: HTTP/1.1 • HTTP status code: 200 • Bytes sent: 96433 • Referring URI: http://www.archive.org/web/web.php • User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7 IPs anonymized by Internet Archive
  • 8. Access Patterns for Robots and Humans in Web Archives 8Who and What Links to the Internet Archive Pre-Processing • Data Cleaning • Session Identification • Robot Detection AlNoamany 2013 8
  • 9. Access Patterns for Robots and Humans in Web Archives 9Who and What Links to the Internet Archive Data Cleaning 9 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0" http://web.archive.org/web/20070519015308/ http://www.jcdl.org/
  • 10. Access Patterns for Robots and Humans in Web Archives 10Who and What Links to the Internet Archive Embedded Resources 10 http://web.archive.org/web/20070519015308/ http://www.jcdl.org/ 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
  • 11. Access Patterns for Robots and Humans in Web Archives 11Who and What Links to the Internet Archive Embedded Resources 11 http://web.archive.org/web/20070519015308/ http://www.jcdl.org/ 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
  • 12. Access Patterns for Robots and Humans in Web Archives 12Who and What Links to the Internet Archive Static Resources 12 http://web.archive.org/web/20070519015308/ http://www.jcdl.org/ 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
  • 13. Access Patterns for Robots and Humans in Web Archives 13Who and What Links to the Internet Archive Static Resources 13 http://web.archive.org/web/20070519015308/ http://www.jcdl.org/ 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
  • 14. Access Patterns for Robots and Humans in Web Archives 14Who and What Links to the Internet Archive Invalid Requests 14 http://web.archive.org/web/20100102003557/ about:blank 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
  • 15. Access Patterns for Robots and Humans in Web Archives 15Who and What Links to the Internet Archive Invalid Requests 15 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0" http://web.archive.org/web/20100102003557/ about:blank
  • 16. Access Patterns for Robots and Humans in Web Archives 16Who and What Links to the Internet Archive Requests that had 3xx Status Code 16 http://web.archive.org/web/20130114160045/ http://www.jcdl.org/0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0"
  • 17. Access Patterns for Robots and Humans in Web Archives 17Who and What Links to the Internet Archive Requests that had 3xx Status Code 17 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0" http://web.archive.org/web/20130114160045/ http://www.jcdl.org/ curl -I "http://web.archive.org/web/20140004100000/http://www.jcdl.org/" HTTP/1.1 302 Moved Temporarily Server: Tengine/1.4.3 Date: Tue, 02 Jul 2013 19:48:59 GMT Content-Type: application/octet-stream Content-Length: 0 Connection: keep-alive set-cookie: wayback_server=10; Domain=archive.org; Path=/; Expires=Thu, 01-Aug-13 19:48:59 GMT; Location: /web/20130114160045/http://www.jcdl.org/
  • 18. Access Patterns for Robots and Humans in Web Archives 18Who and What Links to the Internet Archive Requests that had 3xx Status Code 18 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308/http ://www.jcdl.org/HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20070519015308im_/h ttp://www.jcdl.org/images/jcdl2007-edie.jpg HTTP/1.1" 200 2137 "-" "Mozilla/5.0" 0.11.160.135 [02/Feb/2012:00:01:03] "GET http://staticweb.archive.org/images/toolbar/wa yback-toolbar-logo.png HTTP/1.1" 200 3700 "–" "Mozilla/5.0" 0.151.147.108 [02/Feb/2012:00:01:03] "GET http://web.archive.org/web/20100102003557/abou t:blank HTTP/1.1" 302 0 "www.xx.com" "Mozilla/4.0" 0.26.129.146 - - [02/Feb/2012:00:01:54] "GET http://web.archive.org/web/20140004100000/http ://www.jcdl.org/ HTTP/1.1" 302 0 "-" "Mozilla/5.0" http://web.archive.org/web/20130114160045/ http://www.jcdl.org/
  • 19. Access Patterns for Robots and Humans in Web Archives 19Who and What Links to the Internet Archive Session: Set of Web Pages Requested by a Particular User 19 1 mins 4 mins 3 mins 9 mins p1 p2 p3 p4 p5
  • 20. Access Patterns for Robots and Humans in Web Archives 20Who and What Links to the Internet Archive Session: Set of Web Pages Requested by a Particular User 20 1 mins 4 mins 3 mins 9 mins p1 p2 p3 p4 p5 Time between two requests ≤ 10 mins
  • 21. Access Patterns for Robots and Humans in Web Archives 21Who and What Links to the Internet Archive Session Identification • Threshold timeout: 10 minutes Liu et al. 2007, Spiliopoulou et al. 2003 • Grouping: based on the IP and User- Agent 21
  • 22. Access Patterns for Robots and Humans in Web Archives 22Who and What Links to the Internet Archive Robot Detection is a Big Challenge 22 I’m not a robot
  • 23. Access Patterns for Robots and Humans in Web Archives 23Who and What Links to the Internet Archive User-Agent Check 0.182.141.149 - - [02/Feb/2012:00:01:51 +0000] "GET http://wayback.archive.org/web/199906 01000000*/http://www.belizefirst.com/ HTTP/1.0" 200 98507 "-" "Python-urllib/1.17" 23
  • 24. Access Patterns for Robots and Humans in Web Archives 24Who and What Links to the Internet Archive Number of User-Agents per IP 24
  • 25. Access Patterns for Robots and Humans in Web Archives 25Who and What Links to the Internet Archive Number of User-Agents per IP 25 One IP with User-Agent ≥20 = lying Robot
  • 26. Access Patterns for Robots and Humans in Web Archives 26Who and What Links to the Internet Archive Robots.txt File • Session that contains an access for robots.txt is a robot 26 0.182.141.149 - - [02/Feb/2012:06:20:46 +0000] "GET http://web.archive.org/robots.txt HTTP/1.0" 200 125 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)" 0.182.141.149 - - [02/Feb/2012:06:20:19 +0000] "GET http://wayback.archive.org/web/*/http://www.devilscafe.in HTTP/1.1" 404 2168 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)" 0.182.141.149 - - [02/Feb/2012:06:21:19 +0000] "GET http://wayback.archive.org/web/*/http://www.genie.co.il HTTP/1.1" 200 96205 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)"
  • 27. Access Patterns for Robots and Humans in Web Archives 27Who and What Links to the Internet Archive 6 Requests, 2 Seconds  Robot 27 0.182.141.149 - - [02/Feb/2012:07:00:01 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 106433 “-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:01 +0000] "GET http://wayback.archive.org/web/*/http://www.bbc.com HTTP/1.1" 200 566433 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET http://wayback.archive.org/web/*/http://www.google.com HTTP/1.1" 200 96433 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET http://wayback.archive.org/web/*/http://www.yahoo.com HTTP/1.1" 200 933333 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET http://wayback.archive.org/web/*/http://www.bing.com HTTP/1.1" 200 964333 “-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.182.141.149 - - [02/Feb/2012:07:00:3 +0000] "GET http://wayback.archive.org/web/*/http://www.jcdl.org HTTP/1.1" 200 123233 “-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
  • 28. Access Patterns for Robots and Humans in Web Archives 28Who and What Links to the Internet Archive 3 Requests, 520 Seconds (9 Minutes)  Human 28 0.11.160.13 - - [02/Feb/2012:07:00:00 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 106433 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.11.160.13 - - [02/Feb/2012:07:03:46 +0000] "GET http://wayback.archive.org/web/20100330042821/http://www.cnn.com HTTP/1.1" 200 566433 " http://wayback.archive.org/web/*/http://www.cnn.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) 0.11.160.13 - - [02/Feb/2012:07:08:00 +0000] "GET http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 " http://wayback.archive.org/web/*/http://www.cnn.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
  • 29. Access Patterns for Robots and Humans in Web Archives 29Who and What Links to the Internet Archive Image-to-HTML Ratio 29 If I download these, I’m not a robot
  • 30. Access Patterns for Robots and Humans in Web Archives 30Who and What Links to the Internet Archive Image-to-HTML Ratio • The ratio between the number of image files and the number of HTML files per session • Robots sessions are less than 1:10 image to HTML ratio(Stassopoulou et al. 2005) 30
  • 31. Access Patterns for Robots and Humans in Web Archives 31Who and What Links to the Internet Archive finish • Who link to the archive? • How do people reach web archives? • Why they link to the Archive? • Deep links? Found in Archive YesNo Check the language of the archived page Found on Live Web Check the Status Code YesNo Check the language Check status code on the live web Check existence in other archives
  • 32. Access Patterns for Robots and Humans in Web Archives 32Who and What Links to the Internet Archive Languages for Pages in the Archive
  • 33. Access Patterns for Robots and Humans in Web Archives 33Who and What Links to the Internet Archive Languages for Requested Pages NOT in the Archive
  • 34. Access Patterns for Robots and Humans in Web Archives 34Who and What Links to the Internet Archive Most Languages Self-Link
  • 35. Access Patterns for Robots and Humans in Web Archives 35Who and What Links to the Internet Archive The Existence of the Archived Pages on the Live Web Humans Robots URI-Rs available on live web 36.4% 62.5% URI-Rs missing from live web 63.6% 37.5% Humans come to the archive because they can’t find web pages on the live web
  • 36. Access Patterns for Robots and Humans in Web Archives 36Who and What Links to the Internet Archive The Existence of Unarchived Pages on the Live Web Humans Robots URI-Rs available on live web 25.4% 33.2% URI-Rs missing from live web 74.6% 66.8%
  • 37. Access Patterns for Robots and Humans in Web Archives 37Who and What Links to the Internet Archive Existence in Other Web Archives Web Archive #URI-R #URI-M Internet Archive (2013) 56,503 1,657,264 The National Archives 787 15,354 ArchiefWeb 47 18,347 Archive-It 41 4,682 UK Web Archive 38 12,277 Library of Congress 35 1,092 WebCite 29 1,104 The number of the requested pages not in the archive is 211,825 (2012)
  • 38. Access Patterns for Robots and Humans in Web Archives 38Who and What Links to the Internet Archive 82% of Human Sessions Have Referring URIS WebSite Percentage Description en.wikipedia.org 12.9% Wikipedia archive.org 11.9% IA Home Page reddit.com 10.2% Social News Web Site google.TLD 9.9% Search Engine info-poland.buffalo.edu 1.5% Polish Studies de.wikipedia.org 1.4% Wikipedia cracked.com 1.2% Humor Site snopes.com 1.1% Urban Legends Reference Pages facebook.com 0.9% Social Media crochetpatterncentral.com 0.9% Crocheting Hobbies
  • 39. Access Patterns for Robots and Humans in Web Archives 39Who and What Links to the Internet Archive Many European Domains Link to IA ccTLD .com .uk .de .ca .jp .pl .nl .ru .fr .br Percentage 56.7% 6.0% 5.3% 4.8% 3.7% 2.2% 1.9% 1.7% 1.5% 1.4% TLD .com .org .net .jp .ru .de .edu .to .uk .info Percentage 45.4% 33.9% 8.4% 1.8% 1.4% 1.4% 1.1% 0.7% 0.6% 0.5% The top 10 TLDs of the referrers. The top 10 ccTLDs of Google search referrers.
  • 40. Access Patterns for Robots and Humans in Web Archives 40Who and What Links to the Internet Archive Most of the Links (86%) Are to Mementos
  • 41. Access Patterns for Robots and Humans in Web Archives 41Who and What Links to the Internet Archive Significant Bias for the Recent Past
  • 42. Access Patterns for Robots and Humans in Web Archives 42Who and What Links to the Internet Archive For 83% of Externally Linked Mementos, Corresponding Original URI is 404 on Live Web
  • 43. Access Patterns for Robots and Humans in Web Archives 43Who and What Links to the Internet Archive Conclusions • English is the most common language, followed by many European languages, and Japanese & Vietnamese • Languages self-link (and link to English) • 82% of human sessions have referrals • 86% of the referring web pages link deeply to mementos • 83% of the links to these mementos are because their corresponding URI-Rs do not exist on the live web

Editor's Notes

  1. no previous work has been carried out to answer these questions: What content languages are web archive users looking for? Why do users come to web archives? Where do web archive users come from? Who links to web archives? How do sites link to web archives? Do sites link deeply to specific archived pages or link to the repository? Why do sites link to the past?
  2. We followed log pre-processing steps that existed and tested long time ago, except one thing, which will be mentioned later
  3. Data cleaning is the removing the log entries that were not irrelevant to the analysisNormally, in this step the self-identified robots are removed, but because the robots of web archives come intentionally ….
  4. Requests that were generated automatically by the webbrowser for embedded resources of the requested web
  5. • Static resources of the Internet Archive web site and the
  6. Some web sites have links for malformed URI-Rs among their embedded resources, so that each request on theirweb sites caused automatic requests to the Wayback Machine server.
  7. We also removed redirections, because the wayback machine behaviour when the user request archived version in specific time which is not archived. It responses with redirection to the nearest archived version For example if you request jcdl home page in 2014, it will respond with the last archived copy of this site
  8. Embedded resourcesEntries with an HTTP status code other than HTTP 200, 404, or 503HEAD requestStatic resourcesInvalid requests0.151.147.108 - - [02/Feb/2012:00:01:03 +0000] "GET http://web.archive.org/web/20100102003557/about:blank HTTP/1.1" 302 0 "http://www.fastpasstv.ms/tv/justified/season-3/episode-3/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"
  9. First, we group all the requests based on the IP and User-Agent (as described in Section 4.4). Second, we apply a threshold timeout, so that if the time elapsed between two consecutive requests is longer than this threshold, the second request is considered to be the first request of the new session.
  10. So that the time elapsed between two consecutive requests is longer than this threshold
  11. First, we group all the requests based on the IP and User-Agent (as described in Section 4.4). Second, we apply a threshold timeout, so that if the time elapsed between two consecutive requests is longer than this threshold, the second request is considered to be the first request of the new session.
  12. Because many robots lie about their identity, robot detection is a big challenge in web usage mining
  13. This is the simplest rule, we check the UA field for the self-identified robots
  14. When we were doing session identification, while we were grouping the session based on the IP and the user agent fields, we found that there are lying robots who change their user agent every request.Some robots change their User-Agent every request
  15. Some robots change their User-Agent eUsers with more than 20 different User-Agents were classified as robots.very request
  16. We also removed the sessions which has download for robot.txtThe next rule is browsing speed
  17. Human sessions should have more images than robot sessionsbecause of the embedded images present in most HTMLpages. Robots tend to retrieve only HTML pages, while ignoringimage formats.with less than one image file for every 10 HTML files as a robot
  18. We labeled the sessions that have less than 1:10 IM-to-HTML ratio as a robot
  19. What users are looking for Users request English pages the most, followed by the European languages
  20. The large percentage of European languages among the unarchived pagescan be a good indicator for archival demand for European web pages.
  21. The notable exceptions of Japanese  {Bengali, Vietnamese} and German Portuguese.
  22. humans comes to web archives because they don’t find web pages on the live web.
  23. How people How do people reach web archives?Who links to web archives?
  24. Most of the links (86%) from websites are to individual archived pages at specic points in time,and of those 83% no longer exist on the live web
  25. Why sites link to web archives
  26. We provided analysis for the distributions of languages to gain insight about what users look for