EVALUATING SLIDING AND
STICKY TARGET POLICIES
BY MEASURING TEMPORAL
DRIFT IN ACYCLIC WALKS
THROUGH A WEB ARCHIVE
SCOTT G. AINSWORTH
MICHAEL L. NELSON
OLD DOMINION UNIVERSITY
COMPUTER SCIENCE
JCDL 2013
JULY 23-25, 2013
INDIANAPOLIS, INDIANA USA
JointConferenceonDigitalLibraries(JCDL)2013
A FABLE FROM WAYBACK
7/23/13 Scott G. Ainsworth • Michael L. Nelson
2
A long, long time ago…
ODU Computer Science
updated its web site…
What did it look like?
May 2005...
JointConferenceonDigitalLibraries(JCDL)2013
A FABLE FROM WAYBACK
7/23/13 Scott G. Ainsworth • Michael L. Nelson
3
JointConferenceonDigitalLibraries(JCDL)2013
A FABLE FROM WAYBACK
7/23/13 Scott G. Ainsworth • Michael L. Nelson
4
JointConferenceonDigitalLibraries(JCDL)2013
A FABLE FROM WAYBACK
7/23/13 Scott G. Ainsworth • Michael L. Nelson
5
JointConferenceonDigitalLibraries(JCDL)2013
A FABLE FROM WAYBACK
7/23/13 Scott G. Ainsworth • Michael L. Nelson
6
JointConferenceonDigitalLibraries(JCDL)2013
WHAT JUST HAPPENED?
WHAT WE EXPECTED
2005-05-14 @ 01:36:08
WHAT WE GOT
2005-03-31 @ 09:16:10
7/23/13 Scott G. Ainsworth • Michael L. Nelson
7
JointConferenceonDigitalLibraries(JCDL)2013
SLIDING TARGET
7/23/13 Scott G. Ainsworth • Michael L. Nelson
8
2005-05-14
01:36:08
JointConferenceonDigitalLibraries(JCDL)2013
SLIDING TARGET
7/23/13 Scott G. Ainsworth • Michael L. Nelson
9
2005-04-22
00:17:52
JointConferenceonDigitalLibraries(JCDL)2013
SLIDING TARGET
7/23/13 Scott G. Ainsworth • Michael L. Nelson
10
2005-03-31
09:16:10
JointConferenceonDigitalLibraries(JCDL)2013
STICKY TARGET
What if the target
is held steady?
(Enabled by Memento API)
7/23/13 Scott G. Ainsworth • Michael L. Nelson
11
JointConferenceonDigitalLibraries(JCDL)2013
MEMENTO HTTP EXTENSION*
Adds ability to request a particular date-time
Enables Sticky Target
Request
Response
7/23/13 Scott G. Ainsworth • Michael L. Nelson
12
GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1
…
Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT
…
HTTP/1.1 200 OK
…
Memento-Datetime: Sat, 14 May 2005 01:36:08 GMT
…
*https://datatracker.ietf.org/doc/draft-vandesompel-memento/
JointConferenceonDigitalLibraries(JCDL)2013
2005-05-
14
2005-05-
14
01:36:08
STICKY TARGET
7/23/13 Scott G. Ainsworth • Michael L. Nelson
13
MementoFoxExtension
JointConferenceonDigitalLibraries(JCDL)2013
STICKY TARGET
7/23/13 Scott G. Ainsworth • Michael L. Nelson
14
2005-04-22
00:17:52
JointConferenceonDigitalLibraries(JCDL)2013
STICKY TARGET
7/23/13 Scott G. Ainsworth • Michael L. Nelson
15
2005-05-
14
01:36:08
JointConferenceonDigitalLibraries(JCDL)2013
DRIFT COMPARISON
Page
Sliding Sticky
Datetime Drift Datetime Drift
CS Home
2005-05-14
01:36:08
–
2005-05-14
01:36:08
–
Science
Home
2005-04-22
00:17:52
22.1 days
2005-04-22
00:17:52
22.1 days
CS Home
2005-03-31
09:16:10
43.7 days
(+21.6 days)
2005-05-14
01:36:08
–
Mean 32.9 days 11.0 days
7/23/13 Scott G. Ainsworth • Michael L. Nelson
16
JointConferenceonDigitalLibraries(JCDL)2013
QUESTIONS
How much temporal drift is there with the two
policies?
Does the sticky policy reduce drift as expected?
If so, by how much?
How do
• Choice (number of links)
• Domains visited
• Walk length
Influence drift?
7/23/13 Scott G. Ainsworth • Michael L. Nelson
17
JointConferenceonDigitalLibraries(JCDL)2013
CONTENTS
 Motivation
 Related work
 Measuring Drift
 Results
 Future work
7/23/13 Scott G. Ainsworth • Michael L. Nelson
18
JointConferenceonDigitalLibraries(JCDL)2013
RELATED WORK
Control Crawl Data Quality, Future collections
• Spaniol et al. – crawling strategy
• Denev et al. – change rates by MIME type and
depth
• Ben Saad et al. – metadata from crawl used to
select best results from archive
Our Focus: Existing Data Quality
• Existing collections
• Datetime selection policies
7/23/13 Scott G. Ainsworth • Michael L. Nelson
19
JointConferenceonDigitalLibraries(JCDL)2013
CONTENTS
 Motivation
 Related work
 Measuring drift
 Results
 Future work & conclusions
7/23/13 Scott G. Ainsworth • Michael L. Nelson
20
JointConferenceonDigitalLibraries(JCDL)2013
DEFINITIONS
Walk Length
Number of successful steps
(HTTP 200 response)
Unique
Domains
Number of unique domains
(jcdl.org, amazon.com, etc.)
Choice
Number of unique links
(calculated per page)
Drift | target-datetime1 – Memento-Datetimei |
7/23/13 Scott G. Ainsworth • Michael L. Nelson
21
JointConferenceonDigitalLibraries(JCDL)2013
PROCESS BY EXAMPLE
Select a URI
• Random selection of 1 out of 4,000
4000 Sample URIs – same as JCDL 2011 paper
• DMOZ – a reference
• Search Engines – best random sampling
• Bitly – does shortening have an impact?
• Delicious – does popularity have an impact?
“How Much of the Web Is Archived?”
http://arxiv.org/abs/1212.6177
7/23/13 Scott G. Ainsworth • Michael L. Nelson
22
JointConferenceonDigitalLibraries(JCDL)2013
PROCESS BY EXAMPLE
First, select a URI
• Random selection of 1 out of 4,000
Second, download timemap
7/23/13 Scott G. Ainsworth • Michael L. Nelson
23
<http://api.wayback.archive.org/memento/20050507093740/http://www.cs.odu.edu/>;
rel="memento";
datetime="Sat, 07 May 2005 09:37:40 GMT",
<http://api.wayback.archive.org/memento/20050514013608/http://www.cs.odu.edu/>;
rel="memento";
datetime="Sat, 14 May 2005 01:36:08 GMT",
<http://api.wayback.archive.org/memento/20050515002903/http://www.cs.odu.edu/>;
rel="memento";
datetime="Sun, 15 May 2005 00:29:03 GMT",
<http://api.wayback.archive.org/memento/20050514013608/http://www.cs.odu.edu/>;
rel="memento";
datetime="Sat, 14 May 2005 01:36:08 GMT",
JointConferenceonDigitalLibraries(JCDL)2013
PROCESS BY EXAMPLE
Next, download both mementos
Wayback Machine Memento API
7/23/13 Scott G. Ainsworth • Michael L. Nelson
24
JointConferenceonDigitalLibraries(JCDL)2013
PROCESS BY EXAMPLE
Next, download both mementos
And Find common links
Wayback Machine Memento API
7/23/13 Scott G. Ainsworth • Michael L. Nelson
25
JointConferenceonDigitalLibraries(JCDL)2013
STATUS SO FAR
Successful Steps 1
Unique Domains 1
Choice 48
Mean Drift (days) 0.0 WB 0.0 API
7/23/13 Scott G. Ainsworth • Michael L. Nelson
26
JointConferenceonDigitalLibraries(JCDL)2013
PROCESS BY EXAMPLE
Find common links
and select one for the next step
Wayback Machine Memento API
7/23/13 Scott G. Ainsworth • Michael L. Nelson
27
JointConferenceonDigitalLibraries(JCDL)2013
PROCESS BY EXAMPLE
The timemap downloaded, the best datetimes are
selected, and the memento downloaded…
Wayback Machine Memento API
7/23/13 Scott G. Ainsworth • Michael L. Nelson
28
Successful Steps 1 + 1 = 2
Unique Domains 1 + 0 = 1
Choice 48 + 36 = 84
Mean Drift (days) 11.0 WB 11.0 API
JointConferenceonDigitalLibraries(JCDL)2013
PROCESS BY EXAMPLE
Again for http://www.odu.edu
Wayback Machine Memento API
7/23/13 Scott G. Ainsworth • Michael L. Nelson
29
Successful Steps 2 + 1 = 3
Unique Domains 1 + 0 = 1
Choice 84 + 33 = 117
Mean Drift (days) 14.7 WB 7.4 API
JointConferenceonDigitalLibraries(JCDL)2013
HTTP Response:
• 302 Redirect
• Location header
PROCESS BY EXAMPLE
And for http://www.odusports.com
Redirected at acquisition time
Wayback Machine Memento API
7/23/13 Scott G. Ainsworth • Michael L. Nelson
30
JointConferenceonDigitalLibraries(JCDL)2013
PROCESS BY EXAMPLE
And for http://odusports.collegesports.com
Wayback Machine Memento API
7/23/13 Scott G. Ainsworth • Michael L. Nelson
31
Successful Steps 3 + 1 = 4
Unique Domains 1 + 1 = 2
Choice 117 + 77 = 194
Mean Drift (days) 18.2 WB 7.3 API
JointConferenceonDigitalLibraries(JCDL)2013
PROCESS BY EXAMPLE
And for http://www.vtext.com
Wayback Machine Memento API
7/23/13 Scott G. Ainsworth • Michael L. Nelson
32
Successful Steps 4 + 1 = 5
Unique Domains 2 + 1 = 3
Choice 194 + 14 = 208
Mean Drift (days) 20.3 WB 5.8 API
JointConferenceonDigitalLibraries(JCDL)2013
PROCESS BY EXAMPLE
And 404 stops the walk
Wayback Machine Memento API
7/23/13 Scott G. Ainsworth • Michael L. Nelson
33
HTTP Response:
• 404 Not Found
Successful Steps 4 + 1 = 5
Unique Domains 2 + 1 = 3
Choice 194 + 14 = 208
Mean Drift (days) 20.3 WB 5.8 API
JointConferenceonDigitalLibraries(JCDL)2013
STOP CAUSES
First Step Subsequent Steps
Stop Cause Count Percent Count Percent
Timemaps
HTTP 403 74 1.7% 4,803 9.1%
HTTP 404 1,327 30.1% 15,850 29.0%
HTTP 503 0 0.0% 43 0.1%
Other 2 0.0% 180 0.3%
Mementos
HTTP 403 52 1.2% 476 0.9%
HTTP 404 215 4.9% 3,633 6.8%
HTTP 503 1,957 44.4% 10,535 19.9%
Download failed 154 3.5% 589 1.1%
Not HTML 514 11.7% 2,856 5.4%
No Common Links 0 0.0% 12,957 24.4%
Other 117 2.7% 1,128 2.1%
Totals 4,412 53,050
7/23/13 Scott G. Ainsworth • Michael L. Nelson
34
JointConferenceonDigitalLibraries(JCDL)2013
CONTENTS
 Motivation
 Related work
 Measuring drift
 Results
 Future work & conclusions
7/23/13 Scott G. Ainsworth • Michael L. Nelson
35
JointConferenceonDigitalLibraries(JCDL)2013
WALKS AND STEPS
Status Total
Walks Attempted 200,000
Unique Walks 53,100
Successful Walks 48,685
Pct. Successful 91.7%
Steps 240,439
Successful Steps 187,371
w/drift > 1yr 6,701
w/drift > 5yrs 111
Successful Steps/Walk 3.8
7/23/13 Scott G. Ainsworth • Michael L. Nelson
36
JointConferenceonDigitalLibraries(JCDL)2013
WALK LENGTHS
1 10 20 30 40 50
Occurrences (log scale)
Walk Length
Occurrences(logscale)
110100100010000
7/23/13 Scott G. Ainsworth • Michael L. Nelson
37
Walk Length
Occurrences(logscale)
JointConferenceonDigitalLibraries(JCDL)2013
MEDIAN DRIFT BY STEP
Median Drift by Step
Step Number
MedianDrift(Months)
1 10 20 30 40 50
01m2m3m
API
UI
●
●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●
●●●●
●
●
●●●●●●●●
●
●
●
●
●
●
●●●
●●●●●●●●●●●●●●●●●●●●●●
●●
●●
●●●●●
●
●
●
●
●●●●
●
●
● Sliding
● Sticky
MedianDrift(months)
7/23/13 Scott G. Ainsworth • Michael L. Nelson
38
Step Number
JointConferenceonDigitalLibraries(JCDL)2013
DRIFT BY STEP
SLIDING POLICY STICKY POLICY
Drift by Step (UI)
At least 1 memento
At least 8 mementos
At least 64 mementos
At least 512 mementos
At least 4,096 mementos
At least 32,768 mementos
Drift by Step (API)
Drift(Years)
1y2y3y4y5y6y7y8y9y10y
At least 1 memento
At least 8 mementos
At least 64 mementos
At least 512 mementos
At least 4,096 mementos
At least 32,768 mementos
Drift(years)
Step Number Step Number
7/23/13 Scott G. Ainsworth • Michael L. Nelson
39
JointConferenceonDigitalLibraries(JCDL)2013
DRIFT BY CHOICE
7/23/13 Scott G. Ainsworth • Michael L. Nelson
40
Choice
MeanDrift(months)
● Sliding
● Sticky
JointConferenceonDigitalLibraries(JCDL)2013
DRIFT BY DOMAINS
7/23/13 Scott G. Ainsworth • Michael L. Nelson
41
Domain Count
MeanDrift(months) ● Sliding
● Sticky
JointConferenceonDigitalLibraries(JCDL)2013
CONTENTS
 Motivation
 Related work
 Measuring drift
 Results
 Future work & conclusions
7/23/13 Scott G. Ainsworth • Michael L. Nelson
42
JointConferenceonDigitalLibraries(JCDL)2013
FUTURE WORK
Integrate real-world walk patterns
• AlNoamany et al. – Internet Archive logs
• Domains users avoid – link farms, etc.
• Domain clusters
• Self referencing domains – 101celebrities.com
Check other archives
• Other archives now have Memento API
7/23/13 Scott G. Ainsworth • Michael L. Nelson
43
JointConferenceonDigitalLibraries(JCDL)2013
CONCLUSIONS
30 days less drift using Sticky policy.
Sticky policy controls drift;
Sliding policy does not.
7/23/13 Scott G. Ainsworth • Michael L. Nelson
44
JointConferenceonDigitalLibraries(JCDL)2013
BACKUP
7/23/13 Scott G. Ainsworth • Michael L. Nelson
45
JointConferenceonDigitalLibraries(JCDL)2013
WALK LENGTHS
Walk Length DMOZ S.Eng. Delicious Bitly Total
1 5,355 1,239 7,139 1,289 15,076
2 3,571 924 4,857 817 10,169
3 1,891 598 3,311 623 6,423
4 1,212 381 2,228 415 4,236
5 791 315 1,588 314 3,008
6 583 232 1,168 259 2,242
7 417 178 877 186 1,658
8 258 153 651 136 1,198
9 187 111 498 108 904
10 144 79 337 79 679
…
20 14 10 36 9 76
…
41-45 6 2 14 2 24
46-50 6 3 6 1 16
7/23/13 Scott G. Ainsworth • Michael L. Nelson
46
JointConferenceonDigitalLibraries(JCDL)2013
MEAN DRIFT BY STEP
7/23/13 Scott G. Ainsworth • Michael L. Nelson
47
Step Number
MeanDrift(months)
Mean Drift by Step
Step Number
MeanDrift(Months)
1 10 20 30 40 50
01m2m3m4m5m6m7m API
UI
●
●
●●● ●
●●●● ●●●● ●●
●
●
●
●
●
●●●
●
●●
●
●
●●
●
●●
●
●
● ●
●
●
●
●
● ●
● ●
● ●
●
●
●
●
●
●
●
●
●●●●
●●●
● ●●
●●
●●●
●
●●
●
●
● ●
●
●●
●●
●●
●
●
●● ●● ●●
●
● ●● ●●
●
● Sliding
● Sticky
● μ ○ σ
JointConferenceonDigitalLibraries(JCDL)2013
SLIDING TARGET
⟹ GET …/20050514013608/http://www.cs.odu.edu/ HTTP/1.1
⟸ HTTP/1.1 200 OKAY
⟹ GET …/20050514013608/http://sci.odu.edu/ HTTP/1.1
⟸ HTTP/1.1 302 FOUND
Location: …/20050522001752/http://sci.odu.edu/
⟹ GET …/20050522001752/http://sci.odu.edu/ HTTP/1.1
⟸ HTTP/1.1 200 OKAY
⟹ GET …/20050522001752/http://www.cs.odu.edu/ HTTP/1.1
⟸ HTTP/1.1 302 FOUND
Location: …/20050331091610/http://www.cs.odu.edu/
⟹ GET …/20050331091610/http://www.cs.odu.edu/ HTTP/1.1
⟸ HTTP/1.1 200 OKAY
7/23/13 Scott G. Ainsworth • Michael L. Nelson
48
JointConferenceonDigitalLibraries(JCDL)2013
SLIDING TARGET
⟹ GET …/20050514013608/http://www.cs.odu.edu/ HTTP/1.1
⟸ HTTP/1.1 200 OKAY
⟹ GET …/20050514013608/http://sci.odu.edu/ HTTP/1.1
⟸ HTTP/1.1 302 FOUND
Location: …/20050522001752/http://sci.odu.edu/
⟹ GET …/20050522001752/http://sci.odu.edu/ HTTP/1.1
⟸ HTTP/1.1 200 OKAY
⟹ GET …/20050522001752/http://www.cs.odu.edu/ HTTP/1.1
⟸ HTTP/1.1 302 FOUND
Location: …/20050331091610/http://www.cs.odu.edu/
⟹ GET …/20050331091610/http://www.cs.odu.edu/ HTTP/1.1
⟸ HTTP/1.1 200 OKAY
7/23/13 Scott G. Ainsworth • Michael L. Nelson
49
22 Days
44 Days
JointConferenceonDigitalLibraries(JCDL)2013
STICKY TARGET
⟹ GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1
Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT
⟸ HTTP/1.1 302 FOUND
Location: …/20050514013608/http://www.cs.odu.edu/
⟹ GET …/20050514013608/http://www.cs.odu.edu/ HTTP/1.1
⟸ HTTP/1.1 200 OKAY
⟹ GET <timegate>/http://sci.odu.edu/ HTTP/1.1
Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT
⟸ HTTP/1.1 302 FOUND
Location: …/20050522001752/http://sci.odu.edu/
⟹ GET …/20050522001752/http://sci.odu.edu/ HTTP/1.1
⟸ HTTP/1.1 200 OKAY
⟹ GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1
Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT
⟸ HTTP/1.1 302 FOUND
Location: …/20050514013608/http://www.cs.odu.edu/
⟹ GET …/20050514013608/http://www.cs.odu.edu/ HTTP/1.1
⟸ HTTP/1.1 200 OKAY
7/23/13 Scott G. Ainsworth • Michael L. Nelson
50
JointConferenceonDigitalLibraries(JCDL)2013
STICKY TARGET (MEMENTO)
⟹ GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1
Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT
⟸ HTTP/1.1 302 FOUND
Location: …/20050514013608/http://www.cs.odu.edu/
⟹ GET …/20050514013608/http://www.cs.odu.edu/ HTTP/1.1
⟸ HTTP/1.1 200 OKAY
⟹ GET <timegate>/http://sci.odu.edu/ HTTP/1.1
Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT
⟸ HTTP/1.1 302 FOUND
Location: …/20050522001752/http://sci.odu.edu/
⟹ GET …/20050522001752/http://sci.odu.edu/ HTTP/1.1
⟸ HTTP/1.1 200 OKAY
⟹ GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1
Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT
⟸ HTTP/1.1 302 FOUND
Location: …/20050514013608/http://www.cs.odu.edu/
⟹ GET …/20050514013608/http://www.cs.odu.edu/ HTTP/1.1
⟸ HTTP/1.1 200 OKAY
7/23/13 Scott G. Ainsworth • Michael L. Nelson
51
22 Days
0 Days
JointConferenceonDigitalLibraries(JCDL)2013
TWO BROWSING POLICIES
SLIDING TARGET
Target
• Resource datetime
Drift types
• Memento drift
• Target drift
STICKY TARGET
Target
• Original datetime
Drift type
• Only memento drift
7/23/13 Scott G. Ainsworth • Michael L. Nelson
52
JointConferenceonDigitalLibraries(JCDL)2013
TWO TYPES OF DRIFT
Target Drift
• Drift introduced by changing the target datetime
• | received-datetime – original-datetime |
Memento Drift
• Drift introduced by not having the exact datetime
requested available.
• | received-datetime – requested-datetime |
7/23/13 Scott G. Ainsworth • Michael L. Nelson
53

Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in Acyclic Walks Through a Web Archive�

Editor's Notes

  • #2 Please forgive the long title. Let me explain it with a fable…
  • #3 A student at ODU becomes curious about the history of the Computer Science Department and visits the Internet Archive’s Wayback Machine.
  • #4 The student enters http://www.cs.odu.edu and is shown the available dates.The student navigates to2005 and selects 14 May @ 01:36:08.
  • #5 The student review the Computer Science page.Finding the College of Scienceslink interesting link, the student clicks on it.
  • #6 After reviewing the College of Sciences page, the student returns to the Computer Science page, and…
  • #7 1. Whoa! That’s not what was expected!
  • #8 What just happened.We expected the left side, but got the right side.This is a result of the applying the Sliding Target Policy.Highlight the temporal drift.
  • #9 This is an example of the “Sliding Target Policy.”Here is how it works:We started on the May 14 page we selected.When The College of Sciences was clicked,May 14 was used as the target.
  • #10 And, April 22 was nearest Memento (archived version).When The Computer Science was clicked,April 22 was used as the target.
  • #11 And, March 31 was nearest Memento.
  • #12 “What if the target datetime is held steady instead of being allowed to drift?”The Memento extension to HTTP enables this.
  • #13 This is a very abbreviated introduction to the Memento API.The Memento API allows an HTTP client to negotiate a datetime.On request, the client add the Accept-Datetime header.On reply, the server sends the Memento-Datetime header, indicating the actual datetime of the memento returned.Memento-Datetime is generally the acquisition datetime of the archived copy.
  • #14 Sticky target can be accomplished using the MementoFox extension to Firefox.MementoFox allows the datetime desired is entered and remain fixed.(CLICK)The nearest Memento is retrieved.(CLICK)In this case, the May 14 Computer Science page—same as we selected using the Wayback Machine UI.When the College of Sciences is clicked…(CLICK)
  • #15 The April 22 page is shown again, because the target datetime is still 2005-05-14.So it is still the nearest.(CLICK)When Computer Science is clicked again…
  • #16 May 15 is shown as expected.(PAUSE)
  • #17 Here is a quick comparison:Review Sticky drift is 1/3 of Sliding
  • #18 This leads to questions:How much temporal drift can be expected?How much improvement can Sticky provide (assuming it is the policy needed)?Does Sticky always produce less drift.
  • #19 The rest of this presentation will take the following form:A brief discussion of related work and how this research improves our knowledge.Describe how we measured drift?A review of the results.A quick look at how this work can be refined.
  • #20 The majority of work to date has focused on improving the quality of data acquisition.Spaniol et al. focused on strategy.Denev et a. looked at change rate by MIME type.Ben Saad et al. crawl metadata used to improve presentation to the user.Our focus is getting the best results from existing collectionsAfter all, we can’t go back and “fix” past data acquisition.
  • #22 Let start with a few definitions.Walk length is the number of successful steps; step with HTTP 200 responses for both the timemap and memento.Choice is the sum of the number of unique links at each walk step.Unique domains is the number of domains seen during the walk. This is domains such as jcdl.org or amazon.com. Independent sites within domains were not segragated (e.g. wordpress.com is a single domain).Drift is the magnitude of the difference between the initial target datetime and Memento-Datetime.
  • #23 Let us return to our fable starting with the selection of the first memento.The first step of the process is selecting a URI.
  • #24 Next the URI’s timemap is downloaded.Timemaps are a computer-readable form the the calendar page.(CLICK)This is a partial timemap for www.cs.odu.edu.Once we have the timemap, a memento is randomly selected.(CLICK)This is the entry for ODU CS Home on May 14, 2005 02:48:46.
  • #25 Next both mementos (Wayback Machine and Memento API) are downloaded.
  • #26 And common links are determined.This completes the first iteration of the process.Let look at the statistics so far.
  • #27 So far we have1 successful step1 unique domain (odu.edu)42 links (choice)And no drift. (But note that drift greater than 0 is not always the case on the first step.)
  • #28 To start the next iteration, a link is randomly selected.
  • #29 Subsequent iterations are similar to the first.The only difference is that since the target datetime could have drifted on the Wayback machine side, it is possible that two different mementos are selected.
  • #30 From the College of Sciences, we go to the ODU home page.This adds a successful step,But does not add a new domain.It also adds 36 additional links.Note the missing image. This is quite common but does not change drift calculations.
  • #31 This is an example of an acquisition-time redirect.
  • #32 In this case, www.odusports.com redirected to odusports.collegesports.com, which is probably a service provider.
  • #33 The ODU Sports page has a link to vtext.com, probably because Verizon was a sponsor.
  • #34 Finally, clicking on “Get It Now” stops the walk with a 404.
  • #35 Walks stop for many reasons.The main reasons are:(CLICK ON EACH)403: Access not allowed404: Not archived503: Not currently availableNot HTML (no links)No common links (divergent versions)
  • #38 OccurrencesExponential scale.Very few walks make it pas mid-20s.Mean DriftShows that stick is 45-60 nearer on average(CLICK)Counter to intuition that drift decreases over timeAnd standard distribution is all over the place
  • #39 The data is variable enough that median is the best measure of central tendency.The main point of this graph is that the Sticky policy reigns in drift andThe sliding policy allows it to continue to increase.Notes:The initial up curve is due to choosing a known Memento-Datetime.We suspect the drop starting at steps 42+ is due to large, self-referencing sites (101celebrities.com) and clusters of related sites.
  • #40 Here is another look at the data.Again blue is the sliding policy and green sticky.Blacks and red are high density,Orange and red medium.Blues and greens low.An interesting note, even on the first step there is sometimes considerable drift.This happens when the archive redirects from one Memento-Datetime to another.Even though each of these graphs represents over 48K mementos, the sliding policy graph is more spread out because the drift is higher.But let’s focus on the highest density points, those with 64 or more Mementos.Here the increase drift is clearly visible in the increase height at nearly every step.
  • #41 Next is drift by choice.Choice, on the horizontal scale, is exponential.Choice is the total choice per walk, so the data clusters at the lower number because there are more shorter walks.The key here is that drift does increase with choice, but not by much.
  • #42 Number of domains on the other hand, has a dramatic effect on drift.Here, the horizontal access is the number of domains in a walk.The vertical access is the mean drift across all the walks with the same number of domains.Like walk length, the stick policy controls drift andThe sliding policy allows it to increase.
  • #48 OccurrencesExponential scale.Very few walks make it pas mid-20s.Mean DriftShows that stick is 45-60 nearer on average(CLICK)Counter to intuition that drift decreases over timeAnd standard distribution is all over the place