Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in Acyclic Walks Through a Web Archive�

EVALUATING SLIDING AND
STICKY TARGET POLICIES
BY MEASURING TEMPORAL
DRIFT IN ACYCLIC WALKS
THROUGH A WEB ARCHIVE
SCOTT G. AINSWORTH
MICHAEL L. NELSON
OLD DOMINION UNIVERSITY
COMPUTER SCIENCE
JCDL 2013
JULY 23-25, 2013
INDIANAPOLIS, INDIANA USA

JointConferenceonDigitalLibraries(JCDL)2013
A FABLE FROM WAYBACK
7/23/13 Scott G. Ainsworth • Michael L. Nelson
2
A long, long time ago…
ODU Computer Science
updated its web site…
What did it look like?
May 2005...

3

4

5

6

WHAT JUST HAPPENED?
WHAT WE EXPECTED
2005-05-14 @ 01:36:08
WHAT WE GOT
2005-03-31 @ 09:16:10
7

SLIDING TARGET
8
2005-05-14
01:36:08

SLIDING TARGET
9
2005-04-22
00:17:52

SLIDING TARGET
10
2005-03-31
09:16:10

STICKY TARGET
What if the target
is held steady?
(Enabled by Memento API)
11

MEMENTO HTTP EXTENSION*
Adds ability to request a particular date-time
Enables Sticky Target
Request
Response
12
GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1
…
Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT
…
HTTP/1.1 200 OK
…
Memento-Datetime: Sat, 14 May 2005 01:36:08 GMT
…
*https://datatracker.ietf.org/doc/draft-vandesompel-memento/

2005-05-
14
2005-05-
14
01:36:08
STICKY TARGET
13
MementoFoxExtension

STICKY TARGET
14
2005-04-22
00:17:52

STICKY TARGET
15
2005-05-
14
01:36:08

DRIFT COMPARISON
Page
Sliding Sticky
Datetime Drift Datetime Drift
CS Home
2005-05-14
01:36:08
–
2005-05-14
01:36:08
–
Science
Home
2005-04-22
00:17:52
22.1 days
2005-04-22
00:17:52
22.1 days
CS Home
2005-03-31
09:16:10
43.7 days
(+21.6 days)
2005-05-14
01:36:08
–
Mean 32.9 days 11.0 days
16

QUESTIONS
How much temporal drift is there with the two
policies?
Does the sticky policy reduce drift as expected?
If so, by how much?
How do
• Choice (number of links)
• Domains visited
• Walk length
Influence drift?
17

CONTENTS
 Motivation
 Related work
 Measuring Drift
 Results
 Future work
18

RELATED WORK
Control Crawl Data Quality, Future collections
• Spaniol et al. – crawling strategy
• Denev et al. – change rates by MIME type and
depth
• Ben Saad et al. – metadata from crawl used to
select best results from archive
Our Focus: Existing Data Quality
• Existing collections
• Datetime selection policies
19

CONTENTS
 Motivation
 Related work
 Measuring drift
 Results
 Future work & conclusions
20

DEFINITIONS
Walk Length
Number of successful steps
(HTTP 200 response)
Unique
Domains
Number of unique domains
(jcdl.org, amazon.com, etc.)
Choice
Number of unique links
(calculated per page)
Drift | target-datetime1 – Memento-Datetimei |
21

PROCESS BY EXAMPLE
Select a URI
• Random selection of 1 out of 4,000
4000 Sample URIs – same as JCDL 2011 paper
• DMOZ – a reference
• Search Engines – best random sampling
• Bitly – does shortening have an impact?
• Delicious – does popularity have an impact?
“How Much of the Web Is Archived?”
http://arxiv.org/abs/1212.6177
22

PROCESS BY EXAMPLE
First, select a URI
• Random selection of 1 out of 4,000
Second, download timemap
23
<http://api.wayback.archive.org/memento/20050507093740/http://www.cs.odu.edu/>;
rel="memento";
datetime="Sat, 07 May 2005 09:37:40 GMT",
rel="memento";
rel="memento";
datetime="Sun, 15 May 2005 00:29:03 GMT",
rel="memento";

PROCESS BY EXAMPLE
Next, download both mementos
Wayback Machine Memento API
24

PROCESS BY EXAMPLE
Next, download both mementos
And Find common links
25

STATUS SO FAR
Successful Steps 1
Unique Domains 1
Choice 48
Mean Drift (days) 0.0 WB 0.0 API
26

PROCESS BY EXAMPLE
Find common links
and select one for the next step
27

PROCESS BY EXAMPLE
The timemap downloaded, the best datetimes are
selected, and the memento downloaded…
28
Successful Steps 1 + 1 = 2
Unique Domains 1 + 0 = 1
Choice 48 + 36 = 84

PROCESS BY EXAMPLE
Again for http://www.odu.edu
29
Choice 84 + 33 = 117

HTTP Response:
• 302 Redirect
• Location header
PROCESS BY EXAMPLE
And for http://www.odusports.com
Redirected at acquisition time
30

PROCESS BY EXAMPLE
And for http://odusports.collegesports.com
31
Choice 117 + 77 = 194

PROCESS BY EXAMPLE
And for http://www.vtext.com
32
Choice 194 + 14 = 208

PROCESS BY EXAMPLE
And 404 stops the walk
33
HTTP Response:
• 404 Not Found
Choice 194 + 14 = 208

STOP CAUSES
First Step Subsequent Steps
Stop Cause Count Percent Count Percent
Timemaps
HTTP 403 74 1.7% 4,803 9.1%
HTTP 404 1,327 30.1% 15,850 29.0%
HTTP 503 0 0.0% 43 0.1%
Other 2 0.0% 180 0.3%
Mementos
HTTP 403 52 1.2% 476 0.9%
HTTP 404 215 4.9% 3,633 6.8%
HTTP 503 1,957 44.4% 10,535 19.9%
Download failed 154 3.5% 589 1.1%
Not HTML 514 11.7% 2,856 5.4%
No Common Links 0 0.0% 12,957 24.4%
Other 117 2.7% 1,128 2.1%
Totals 4,412 53,050
34

CONTENTS
 Motivation
 Related work
 Measuring drift
 Results
35

WALKS AND STEPS
Status Total
Walks Attempted 200,000
Unique Walks 53,100
Successful Walks 48,685
Pct. Successful 91.7%
Steps 240,439
Successful Steps 187,371
w/drift > 1yr 6,701
w/drift > 5yrs 111
Successful Steps/Walk 3.8
36

WALK LENGTHS
1 10 20 30 40 50
Occurrences (log scale)
Walk Length
Occurrences(logscale)
110100100010000
37
Walk Length
Occurrences(logscale)

MEDIAN DRIFT BY STEP
Median Drift by Step
Step Number
MedianDrift(Months)
1 10 20 30 40 50
01m2m3m
API
UI
●
●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●
●●●●
●
●
●●●●●●●●
●
●
●
●
●
●
●●●
●●●●●●●●●●●●●●●●●●●●●●
●●
●●
●●●●●
●
●
●
●
●●●●
●
●
● Sliding
● Sticky
MedianDrift(months)
38
Step Number

DRIFT BY STEP
SLIDING POLICY STICKY POLICY
Drift by Step (UI)
At least 1 memento
At least 8 mementos
At least 64 mementos
At least 4,096 mementos
Drift by Step (API)
Drift(Years)
1y2y3y4y5y6y7y8y9y10y
At least 1 memento
At least 8 mementos
Drift(years)
Step Number Step Number
39

DRIFT BY CHOICE
40
Choice
MeanDrift(months)
● Sliding
● Sticky

DRIFT BY DOMAINS
41
Domain Count
MeanDrift(months) ● Sliding
● Sticky

CONTENTS
 Motivation
 Related work
 Measuring drift
 Results
42

FUTURE WORK
Integrate real-world walk patterns
• AlNoamany et al. – Internet Archive logs
• Domains users avoid – link farms, etc.
• Domain clusters
• Self referencing domains – 101celebrities.com
Check other archives
• Other archives now have Memento API
43

CONCLUSIONS
30 days less drift using Sticky policy.
Sticky policy controls drift;
Sliding policy does not.
44

BACKUP
45

WALK LENGTHS
Walk Length DMOZ S.Eng. Delicious Bitly Total
1 5,355 1,239 7,139 1,289 15,076
2 3,571 924 4,857 817 10,169
3 1,891 598 3,311 623 6,423
4 1,212 381 2,228 415 4,236
5 791 315 1,588 314 3,008
6 583 232 1,168 259 2,242
7 417 178 877 186 1,658
8 258 153 651 136 1,198
9 187 111 498 108 904
10 144 79 337 79 679
…
20 14 10 36 9 76
…
41-45 6 2 14 2 24
46-50 6 3 6 1 16
46

MEAN DRIFT BY STEP
47
Step Number
MeanDrift(months)
Mean Drift by Step
Step Number
MeanDrift(Months)
1 10 20 30 40 50
01m2m3m4m5m6m7m API
UI
●
●
●●● ●
●●●● ●●●● ●●
●
●
●
●
●
●●●
●
●●
●
●
●●
●
●●
●
●
● ●
●
●
●
●
● ●
● ●
● ●
●
●
●
●
●
●
●
●
●●●●
●●●
● ●●
●●
●●●
●
●●
●
●
● ●
●
●●
●●
●●
●
●
●● ●● ●●
●
● ●● ●●
●
● Sliding
● Sticky
● μ ○ σ

SLIDING TARGET
⟹ GET …/20050514013608/http://www.cs.odu.edu/ HTTP/1.1
⟸ HTTP/1.1 200 OKAY
⟹ GET …/20050514013608/http://sci.odu.edu/ HTTP/1.1
⟸ HTTP/1.1 302 FOUND
Location: …/20050522001752/http://sci.odu.edu/
Location: …/20050331091610/http://www.cs.odu.edu/
48

SLIDING TARGET
49
22 Days
44 Days

STICKY TARGET
⟹ GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1
⟹ GET <timegate>/http://sci.odu.edu/ HTTP/1.1
50

STICKY TARGET (MEMENTO)
⟹ GET <timegate>/http://sci.odu.edu/ HTTP/1.1
51
22 Days
0 Days

TWO BROWSING POLICIES
SLIDING TARGET
Target
• Resource datetime
Drift types
• Memento drift
• Target drift
STICKY TARGET
Target
• Original datetime
Drift type
• Only memento drift
52

TWO TYPES OF DRIFT
Target Drift
• Drift introduced by changing the target datetime
• | received-datetime – original-datetime |
Memento Drift
• Drift introduced by not having the exact datetime
requested available.
• | received-datetime – requested-datetime |
53

Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in Acyclic Walks Through a Web Archive�

Recommended

Recommended

More Related Content

Similar to Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in Acyclic Walks Through a Web Archive�

Similar to Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in Acyclic Walks Through a Web Archive� (20)

Recently uploaded

Recently uploaded (20)

Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in Acyclic Walks Through a Web Archive�

Editor's Notes