BROWSING AND
RECOMPOSITION POLICIES
TO MINIMIZE TEMPORAL
ERROR WHEN UTILIZING
WEB ARCHIVES
SCOTT G. AINSWORTH
OLD DOMINION...
JointConferenceonDigitalLibraries(JCDL)2013
CONTENTS
 Motivation
 Related work
 Preliminary work
 Future work
 Conclu...
JointConferenceonDigitalLibraries(JCDL)2013
A FABLE FROM WAYBACK
7/23/13 Scott G. Ainsworth • Michael L. Nelson
3
A long, ...
JointConferenceonDigitalLibraries(JCDL)2013
A FABLE FROM WAYBACK
7/23/13 Scott G. Ainsworth • Michael L. Nelson
4
JointConferenceonDigitalLibraries(JCDL)2013
A FABLE FROM WAYBACK
7/23/13 Scott G. Ainsworth • Michael L. Nelson
5
JointConferenceonDigitalLibraries(JCDL)2013
A FABLE FROM WAYBACK
7/23/13 Scott G. Ainsworth • Michael L. Nelson
6
JointConferenceonDigitalLibraries(JCDL)2013
A FABLE FROM WAYBACK
7/23/13 Scott G. Ainsworth • Michael L. Nelson
7
JointConferenceonDigitalLibraries(JCDL)2013
WHAT JUST HAPPENED?
WHAT WE EXPECTED
2005-05-14 @ 01:36:08
WHAT WE GOT
2005-03...
JointConferenceonDigitalLibraries(JCDL)2013
TEMPORAL SPREAD
7/23/13 Scott G. Ainsworth • Michael L. Nelson
9
2005-05-
14
0...
JointConferenceonDigitalLibraries(JCDL)2013
QUESTIONS
• How much temporal drift do users experience?
• How much temporal s...
JointConferenceonDigitalLibraries(JCDL)2013
CONTENTS
 Motivation
 Related work
 Preliminary work
 Future work
 Conclu...
JointConferenceonDigitalLibraries(JCDL)2013
RELATED WORK
Web Crawling for Search Engines
• Douglis – Change rates
• Cho – ...
JointConferenceonDigitalLibraries(JCDL)2013
RELATED WORK
Control Crawl Data Quality, Future collections
• Spaniol et al. –...
JointConferenceonDigitalLibraries(JCDL)2013
RELATED WORK
Use Patterns
• AlNoamony et al. – Archive Access Patterns
• Human...
JointConferenceonDigitalLibraries(JCDL)2013
RELATED WORK – MEMENTO*
• HTTP extension for datetime negotiation
Request
Resp...
JointConferenceonDigitalLibraries(JCDL)2013
CONTENTS
 Motivation
 Related work
 Preliminary work
 How much of the Web ...
JointConferenceonDigitalLibraries(JCDL)2013
HOW MUCH IS ARCHIVED?
7/23/13 Scott G. Ainsworth • Michael L. Nelson
17
35 – 9...
JointConferenceonDigitalLibraries(JCDL)2013
CONTENTS
 Motivation
 Related work
 Preliminary work
 How much of the Web ...
JointConferenceonDigitalLibraries(JCDL)2013
TEMPORAL DRIFT
Comparing two policies
• Sliding – target datetime changes
• St...
JointConferenceonDigitalLibraries(JCDL)2013
SLIDING TARGET
7/23/13 Scott G. Ainsworth • Michael L. Nelson
20
2005-05-14
01...
JointConferenceonDigitalLibraries(JCDL)2013
SLIDING TARGET
7/23/13 Scott G. Ainsworth • Michael L. Nelson
21
2005-04-22
00...
JointConferenceonDigitalLibraries(JCDL)2013
SLIDING TARGET
7/23/13 Scott G. Ainsworth • Michael L. Nelson
22
2005-03-31
09...
JointConferenceonDigitalLibraries(JCDL)2013
STICKY TARGET
What if the target
is held steady?
(Enabled by Memento API)
7/23...
JointConferenceonDigitalLibraries(JCDL)2013
2005-05-14STICKY TARGET
7/23/13 Scott G. Ainsworth • Michael L. Nelson
24
Meme...
JointConferenceonDigitalLibraries(JCDL)2013
STICKY TARGET
7/23/13 Scott G. Ainsworth • Michael L. Nelson
25
2005-04-22
00:...
JointConferenceonDigitalLibraries(JCDL)2013
STICKY TARGET
7/23/13 Scott G. Ainsworth • Michael L. Nelson
26
2005-05-
14
01...
JointConferenceonDigitalLibraries(JCDL)2013
MEDIAN DRIFT BY STEP
Median Drift by Step
Step Number
MedianDrift(Months)
1 10...
JointConferenceonDigitalLibraries(JCDL)2013
CONTENTS
 Motivation
 Related work
 Preliminary work
 How much of the Web ...
JointConferenceonDigitalLibraries(JCDL)2013
TEMPORAL SPREAD
7/23/13 Scott G. Ainsworth • Michael L. Nelson
29
JointConferenceonDigitalLibraries(JCDL)2013
COMPOSITE MEMENTO
PRESENTATION STRUCTURE
7/23/13 Scott G. Ainsworth • Michael ...
JointConferenceonDigitalLibraries(JCDL)2013
TEMPORAL SPREAD
7/23/13 Scott G. Ainsworth • Michael L. Nelson
31
2005-05-
14
...
JointConferenceonDigitalLibraries(JCDL)2013
EMBEDDED RESOURCES
Resource Memento-Datetime Delta Resource
Memento-
Datetime
...
JointConferenceonDigitalLibraries(JCDL)2013
REPRESENTING SPREAD
COMPOSITE MEMENTO
TEMPORAL SPREAD CHART
7/23/13 Scott G. A...
JointConferenceonDigitalLibraries(JCDL)2013
TEMPORAL SPREAD – ODU CS
7/23/13 Scott G. Ainsworth • Michael L. Nelson
34
JointConferenceonDigitalLibraries(JCDL)2013
TEMPORAL COHERENCE
7/23/13 Scott G. Ainsworth • Michael L. Nelson
35
root emb1...
JointConferenceonDigitalLibraries(JCDL)2013
TEMPORAL COHERENCE
7/23/13 Scott G. Ainsworth • Michael L. Nelson
36
1 Memento...
JointConferenceonDigitalLibraries(JCDL)2013
TEMPORAL COHERENCE
7/23/13 Scott G. Ainsworth • Michael L. Nelson
37
1 Memento...
JointConferenceonDigitalLibraries(JCDL)2013
TEMPORAL COHERENCE
7/23/13 Scott G. Ainsworth • Michael L. Nelson
38
1 Memento...
JointConferenceonDigitalLibraries(JCDL)2013
TEMPORAL COHERENCE
7/23/13 Scott G. Ainsworth • Michael L. Nelson
39
1 Memento...
JointConferenceonDigitalLibraries(JCDL)2013
TEMPORAL COHERENCE
7/23/13 Scott G. Ainsworth • Michael L. Nelson
40
1 Memento...
JointConferenceonDigitalLibraries(JCDL)2013
TEMPORAL COHERENCE
7/23/13 Scott G. Ainsworth • Michael L. Nelson
41
n/a
Last-...
JointConferenceonDigitalLibraries(JCDL)2013
TEMPORAL COHERENCE
7/23/13 Scott G. Ainsworth • Michael L. Nelson
42
2 Memento...
JointConferenceonDigitalLibraries(JCDL)2013
TEMPORAL COHERENCE
7/23/13 Scott G. Ainsworth • Michael L. Nelson
43
2 Memento...
JointConferenceonDigitalLibraries(JCDL)2013
TEMPORAL COHERENCE
7/23/13 Scott G. Ainsworth • Michael L. Nelson
44
2 Memento...
JointConferenceonDigitalLibraries(JCDL)2013
TEMPORAL COHERENCE
7/23/13 Scott G. Ainsworth • Michael L. Nelson
45
2 Memento...
JointConferenceonDigitalLibraries(JCDL)2013
TEMPORAL COHERENCE
7/23/13 Scott G. Ainsworth • Michael L. Nelson
46
2 Memento...
JointConferenceonDigitalLibraries(JCDL)2013
FIRST EXPERIMENT
• 1,000 URIs from DMOZ (Open Directory)
• Download all timema...
JointConferenceonDigitalLibraries(JCDL)2013
PRELIMINARY RESULTS 1
Count Description Percent
1,000 Root URI-Rs
910 Root tim...
JointConferenceonDigitalLibraries(JCDL)2013
PRELIMINARY RESULTS 2
Description Minimize
Distance,
Single
Archive
Minimize
D...
JointConferenceonDigitalLibraries(JCDL)2013
CURRENT EXPERIMENT
• 4,000 URIs from JCDL’11 “How Much…” paper
• 1 URI/month v...
JointConferenceonDigitalLibraries(JCDL)2013
CONTENTS
 Motivation
 Related work
 Preliminary work
 Future work
 Conclu...
JointConferenceonDigitalLibraries(JCDL)2013
FUTURE WORK
Browsing Patterns, Clusters & Drift
• AlNoamany et al. – Real-worl...
JointConferenceonDigitalLibraries(JCDL)2013
FUTURE WORK
Timemaps, Redirection, Missing Mementos
• Timemaps only tell part ...
JointConferenceonDigitalLibraries(JCDL)2013
FUTURE WORK
Similarity & Duplication
• Delta are currently | root – embedded |...
JointConferenceonDigitalLibraries(JCDL)2013
FUTURE WORK
Communicating Status
7/23/13 Scott G. Ainsworth • Michael L. Nelso...
JointConferenceonDigitalLibraries(JCDL)2013
FUTURE WORK
Policies & Heuristics
• Drift
• Sliding target
• Sticky target
• S...
JointConferenceonDigitalLibraries(JCDL)2013
CONTENTS
 Motivation
 Related work
 Preliminary work
 Future work
 Conclu...
JointConferenceonDigitalLibraries(JCDL)2013
CONCLUSION
Extensive research on improving acquisition exists
Best use of exis...
JointConferenceonDigitalLibraries(JCDL)2013
TIMELINE
Spread
Policy
Drift
CIKM
May '13
paper
Nov '13
paper
Feb '14
Missing ...
Upcoming SlideShare
Loading in …5
×

Browsing and Recomposition Policies to Minimize Temporal Error When Utilizing Web Archives

1,577 views

Published on

My temporal coherence in public web archives presentation from the JCDL 2013 Doctoral Consortium.

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,577
On SlideShare
0
From Embeds
0
Number of Embeds
760
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Please forgive the long title. Let me explain it with a fable…
  • The rest of this presentation will take the following form:A brief discussion of related work and how this research improves our knowledge.Describe how we measured drift?A review of the results.A quick look at how this work can be refined.
  • A student at ODU becomes curious about the history of the Computer Science Department and visits the Internet Archive’s Wayback Machine.
  • The student enters http://www.cs.odu.edu and is shown the available dates.The student navigates to2005 and selects 14 May @ 01:36:08.
  • The student review the Computer Science page.Finding the College of Scienceslink interesting link, the student clicks on it.
  • After reviewing the College of Sciences page, the student returns to the Computer Science page, and…
  • 1. Whoa! That’s not what was expected!
  • What just happened.We expected the left side, but got the right side.This is a result of the applying the Sliding Target Policy.Highlight the temporal drift.
  • Let return to temporal spread.Even though the display is May 14, 2005(CLICK)The resources are captured at very different times.(CLICK)Some days(CLICK)Some months(CLICK)Even years (in this case a m image in the footer)
  • This leads to questions:
  • The rest of this presentation will take the following form:A brief discussion of related work and how this research improves our knowledge.Describe how we measured drift?A review of the results.A quick look at how this work can be refined.
  • The majority of work to date has focused on improving the quality of data acquisition.Spaniol et al. focused on strategy.Denev et a. looked at change rate by MIME type.Ben Saad et al. crawl metadata used to improve presentation to the user.Our focus is getting the best results from existing collectionsAfter all, we can’t go back and “fix” past data acquisition.
  • The majority of work to date has focused on improving the quality of data acquisition.Spaniol et al. focused on strategy.Denev et a. looked at change rate by MIME type.Ben Saad et al. crawl metadata used to improve presentation to the user.Our focus is getting the best results from existing collectionsAfter all, we can’t go back and “fix” past data acquisition.
  • The majority of work to date has focused on improving the quality of data acquisition.Spaniol et al. focused on strategy.Denev et a. looked at change rate by MIME type.Ben Saad et al. crawl metadata used to improve presentation to the user.Our focus is getting the best results from existing collectionsAfter all, we can’t go back and “fix” past data acquisition.
  • Memento is an HTTP extension for datetime negotiation.Now implemented by the Internet Archive, Archive.is, UK National Archive, and UK Web ArchiveThis is a very abbreviated introduction to the Memento API.The Memento API allows an HTTP client to negotiate a datetime.On request, the client add the Accept-Datetime header.On reply, the server sends the Memento-Datetime header, indicating the actual datetime of the memento returned.Memento-Datetime is generally the acquisition datetime of the archived copy.
  • The rest of this presentation will take the following form:A brief discussion of related work and how this research improves our knowledge.Describe how we measured drift?A review of the results.A quick look at how this work can be refined.
  • At JCDL 2011, we published “How Much of the Web Is Archived?”This density chart gives a sense of Web archival patterns.Each row represents a single URI. So, row 200 represents the 200th URI.The rows are ordered such that the URI with the earliest memento is on the bottom.The empty rows at the top are URIs that are not archived.Each dot represents a single memento.Most mementos, the brown dots, come from the Internet Archive.The Blue dot are search engine caches—note that since this study was completed, the search engine caches have all locked down—effectively, they are no longer viable sources.The red dot represent other archivesx (WebCite, etc.)
  • The rest of this presentation will take the following form:A brief discussion of related work and how this research improves our knowledge.Describe how we measured drift?A review of the results.A quick look at how this work can be refined.
  • We have investigated the temporal drift which occurs while browsing archives.(CLICK)Let use pick up from the introduction
  • This is an example of the “Sliding Target Policy.”Here is how it works:We started on the May 14 page we selected.When The College of Sciences was clicked,May 14 was used as the target.
  • And, April 22 was nearest Memento (archived version).When The Computer Science was clicked,April 22 was used as the target.
  • And, March 31 was nearest Memento.
  • “What if the target datetime is held steady instead of being allowed to drift?”The Memento extension to HTTP enables this.
  • Sticky target can be accomplished using the MementoFox extension to Firefox.MementoFox allows the datetime desired is entered and remain fixed.(CLICK)The nearest Memento is retrieved.(CLICK)In this case, the May 14 Computer Science page—same as we selected using the Wayback Machine UI.When the College of Sciences is clicked…(CLICK)
  • The April 22 page is shown again, because the target datetime is still 2005-05-14.So it is still the nearest.(CLICK)When Computer Science is clicked again…
  • May 15 is shown as expected.(PAUSE)
  • The data is variable enough that median is the best measure of central tendency.The main point of this graph is that the Sticky policy reigns in drift andThe sliding policy allows it to continue to increase.Notes:The initial up curve is due to choosing a known Memento-Datetime.We suspect the drop starting at steps 42+ is due to large, self-referencing sites (101celebrities.com) and clusters of related sites.
  • The rest of this presentation will take the following form:A brief discussion of related work and how this research improves our knowledge.Describe how we measured drift?A review of the results.A quick look at how this work can be refined.
  • Let return to temporal spread.Most web pages are composed from multiple resources, some of which are circled here.(WAIT FOR ANIMATION)
  • We call the collection of all mementos required to display a web page, a composite memento.A composite memento consists of a root and embedded mementos and can be represented as a tree. (It is actually a graph, but can be represented as a tree without loss of generality.)(CLICK)Which is represented as URI-M0 at the top of the tree on the right.Embedded mementos, such as images, are also represented in the tree.Embedded mementos can themselves have embedded mementos, for example HTML in a frame. (The ODU CS home page had frames in its 1990s versions, but no longer does.)
  • Let return to temporal spread.Even though the display is May 14, 2005(CLICK)The resources are captured at very different times.(CLICK)Some days(CLICK)Some months(CLICK)Even years (in this case a m image in the footer)
  • This is a list of all the mementos that comprise http://www.cs.odu.edu.It is a bit of an eye chart, so here is a summary(CLICK)There are 26 embedded mements (27 total including the root)The mean delta (distance from root) is 125.9 days.The standard deviation is 207.7 – which does not bode well for the mean.Here’s the kicker – the spread is 2.1 years!
  • Assume we have a composite resource with two embedded images.The graph on the right represents two composite mementos for this resource.The red diamonds are the root mementos, captured at different datetimes.Roots are centered at 0 delta; embedded mementos are offset by their delta.The blue and orange diamonds represent the embedded mementos.Orange mementos are from the same domain as the root.Blue mementos are from a different domain.Gray diamonds represent reused mementos.
  • Now lets have a look at the full chart (as of mid-2012) for cs.odu.edu.(CLICK)Here is the 2005-05-14 page we have been looking at.(CLICK)Here is page from 2011, (CLICK) and one from 2011.Several things stand out:The maximum spread is nearly 7 years (2005 row)Many embedded resources were acquired well after the corresponding root memento.Reuse appears very high.
  • Consider 2 mementos, 1 root and 1 embedded.(EXPLAIN why there is only one)In this case the embedded memento was captured after the root(POINT OUT WHICH IS WHICH)Is this coherent? -- Hard to tell
  • But add the Last-Modified date and it become more clear.In this case, the embedded memento’s Last-Modified and Memento-Datetime “bracket” the root,Providing evidence that the embedded memento existed when the root was captured.
  • So we consider it coherent.
  • But what happens when the root is not bracketed?In this case, there is evidence that the embedded memento did NOT exist when the root was captured.
  • But what happens when the root is not bracketed?In this case, there is evidence that the embedded memento did NOT exist when the root was captured.We consider this a temporal coherence violation.
  • Similarly, if Last-Modified is missing, it cannot be temporally coherentBut should it be a violation?It could actually be coherent.We are still gathering data on this one.
  • Similarly, if the embedded memento was captured before the root,Was it still in existence when the root was captured?ProbablyBut more study required.
  • Recall the single memento, root not bracketed pattern.
  • What happens is a second memento for the embedded resource is available?We can’t prove either existed when the root was captured.It opens another possibility…
  • Comparing the mementos.Here we introduce similarity measures.For images: direct comparison is appropriate-archive leave these alone.For text, HTML in particular, archives annotate—add comments—with metadata.In this case we must use a similarity measure such as shingling or SimHash.
  • What happens is a second memento for the embedded resource is available?It opens another possibility…Comparing the mementos.
  • If the contents are equal,There is evidence that the embedded memento existed when the root was captured.
  • The rest of this presentation will take the following form:A brief discussion of related work and how this research improves our knowledge.Describe how we measured drift?A review of the results.A quick look at how this work can be refined.
  • Real-world access patterns to bring results more inline with actual user experience.We see real humans go 50 steps?Why: Is there no need? Is the interface a problem? Does it get too weird?Try to avoid sites humans would avoid (very subjective—I avoid 101celebrities.com—you might like it)We suspect both drift and spread are influence by not just single large domains, but also by clusters of related domains. Amazon.com & amazon-images.com for instance. Sussing out related domains will help clarify results.
  • Timemaps only tell part of the storyMemento-Datetimes in timemaps frequently redirect to a different datetime or URIThis is reflected in the drift research but not the spread researchThis redirection will change the deltasBesides, what does it mean when we are redirected to another datetime? (Suspect archive has recognized a duplicate)Another common occurrence is missing mementos. They are in the timemap but not available in the archive.Our research to date simply lists these as missing.But as policies and heuristics are developed, user priorities might required several responses (leave it missing, substitute the next nearest, etc.)
  • Delta is the absolute value of the difference between the room and embedded Memento-Datetimes.But there are other conditions that could or should indicate a delta of 0 instead.These all revolve around determining that no change has occurred.One of these is bracketing mementos.Explain the chart…However, HTML is problematic because comments are added by the archives.So, we cant check for equality.What similarity measure or measures are reasonable substitutes for equality.
  • Succinctly communicating the status of a composite memento or walk to the user is important.(CLICK)This just isn’t very user friendly(CLICK)We need a single mutable icon or symbol that can be easily explained and understood.We may need several, one for casual users and one for researchers.(CLICK)For multiple-archive composites, we also need to acknowledge their contribution.
  • Finally, policies and heuristics must be developed.For example, in the drift work we used sliding and sticky
  • The rest of this presentation will take the following form:A brief discussion of related work and how this research improves our knowledge.Describe how we measured drift?A review of the results.A quick look at how this work can be refined.
  • Browsing and Recomposition Policies to Minimize Temporal Error When Utilizing Web Archives

    1. 1. BROWSING AND RECOMPOSITION POLICIES TO MINIMIZE TEMPORAL ERROR WHEN UTILIZING WEB ARCHIVES SCOTT G. AINSWORTH OLD DOMINION UNIVERSITY COMPUTER SCIENCE JCDL 2013 JULY 23-25, 2013 INDIANAPOLIS, INDIANA USA
    2. 2. JointConferenceonDigitalLibraries(JCDL)2013 CONTENTS  Motivation  Related work  Preliminary work  Future work  Conclusion 7/23/13 Scott G. Ainsworth • Michael L. Nelson 2
    3. 3. JointConferenceonDigitalLibraries(JCDL)2013 A FABLE FROM WAYBACK 7/23/13 Scott G. Ainsworth • Michael L. Nelson 3 A long, long time ago… ODU Computer Science updated its web site… What did it look like? May 2005...
    4. 4. JointConferenceonDigitalLibraries(JCDL)2013 A FABLE FROM WAYBACK 7/23/13 Scott G. Ainsworth • Michael L. Nelson 4
    5. 5. JointConferenceonDigitalLibraries(JCDL)2013 A FABLE FROM WAYBACK 7/23/13 Scott G. Ainsworth • Michael L. Nelson 5
    6. 6. JointConferenceonDigitalLibraries(JCDL)2013 A FABLE FROM WAYBACK 7/23/13 Scott G. Ainsworth • Michael L. Nelson 6
    7. 7. JointConferenceonDigitalLibraries(JCDL)2013 A FABLE FROM WAYBACK 7/23/13 Scott G. Ainsworth • Michael L. Nelson 7
    8. 8. JointConferenceonDigitalLibraries(JCDL)2013 WHAT JUST HAPPENED? WHAT WE EXPECTED 2005-05-14 @ 01:36:08 WHAT WE GOT 2005-03-31 @ 09:16:10 7/23/13 Scott G. Ainsworth • Michael L. Nelson 8
    9. 9. JointConferenceonDigitalLibraries(JCDL)2013 TEMPORAL SPREAD 7/23/13 Scott G. Ainsworth • Michael L. Nelson 9 2005-05- 14 01:36:08 +9 days +18 days +18 days +7 months +2.1 years
    10. 10. JointConferenceonDigitalLibraries(JCDL)2013 QUESTIONS • How much temporal drift do users experience? • How much temporal spread exists in composite mementos? • How can drift and spread be minimized? • What factors contribute, positively or negatively, to drift and spread? • Does combining multiple archives produce better results? • Would users with differing goals benefit from different minimization policies and heuristics? • How can temporal coherence be displayed to users—simply? 7/23/13 Scott G. Ainsworth • Michael L. Nelson 10
    11. 11. JointConferenceonDigitalLibraries(JCDL)2013 CONTENTS  Motivation  Related work  Preliminary work  Future work  Conclusion 7/23/13 Scott G. Ainsworth • Michael L. Nelson 11
    12. 12. JointConferenceonDigitalLibraries(JCDL)2013 RELATED WORK Web Crawling for Search Engines • Douglis – Change rates • Cho – Optimal crawling strategies, change rates, Web evolution Web Archiving • Masanés – Web Archiving: Issues and Methods • Jaffe & Kirkpatrick – Internet Archive architecture • Moore et al. – Heritrix crawler 7/23/13 Scott G. Ainsworth • Michael L. Nelson 12
    13. 13. JointConferenceonDigitalLibraries(JCDL)2013 RELATED WORK Control Crawl Data Quality, Future collections • Spaniol et al. – crawling strategy • Denev et al. – change rates by MIME type and depth • Ben Saad et al. – metadata from crawl used to select best results from archive Our Focus: Existing Data Quality • Existing collections • Datetime selection policies 7/23/13 Scott G. Ainsworth • Michael L. Nelson 13
    14. 14. JointConferenceonDigitalLibraries(JCDL)2013 RELATED WORK Use Patterns • AlNoamony et al. – Archive Access Patterns • Humans vs. Robots • Dip, dive, slide, & skim Identifying Duplicates • Simple identity – images, other binary formats • direct comparison • Hash comparison • HTML, CSS (text) • Shingling, Jaccard distances, etc. • SimHash ⃪ most promise 7/23/13 Scott G. Ainsworth • Michael L. Nelson 14
    15. 15. JointConferenceonDigitalLibraries(JCDL)2013 RELATED WORK – MEMENTO* • HTTP extension for datetime negotiation Request Response 7/23/13 Scott G. Ainsworth • Michael L. Nelson 15 GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1 … Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT … HTTP/1.1 200 OK … Memento-Datetime: Sat, 14 May 2005 01:36:08 GMT … *https://datatracker.ietf.org/doc/draft-vandesompel-memento/
    16. 16. JointConferenceonDigitalLibraries(JCDL)2013 CONTENTS  Motivation  Related work  Preliminary work  How much of the Web is archived  Temporal Drift  Temporal Spread  Future work  Conclusion 7/23/13 Scott G. Ainsworth • Michael L. Nelson 16
    17. 17. JointConferenceonDigitalLibraries(JCDL)2013 HOW MUCH IS ARCHIVED? 7/23/13 Scott G. Ainsworth • Michael L. Nelson 17 35 – 90% At least one archived copy 17 – 49% 2 – 5 copies 1 – 8% 6 – 10 copies 8 – 63% > 10 copies JCDL’11 Internet Archive Search Engine Other
    18. 18. JointConferenceonDigitalLibraries(JCDL)2013 CONTENTS  Motivation  Related work  Preliminary work  How much of the Web is archived  Temporal Drift  Temporal Spread  Future work  Conclusion 7/23/13 Scott G. Ainsworth • Michael L. Nelson 18
    19. 19. JointConferenceonDigitalLibraries(JCDL)2013 TEMPORAL DRIFT Comparing two policies • Sliding – target datetime changes • Sticky – target datetime held steady 7/23/13 Scott G. Ainsworth • Michael L. Nelson 19
    20. 20. JointConferenceonDigitalLibraries(JCDL)2013 SLIDING TARGET 7/23/13 Scott G. Ainsworth • Michael L. Nelson 20 2005-05-14 01:36:08
    21. 21. JointConferenceonDigitalLibraries(JCDL)2013 SLIDING TARGET 7/23/13 Scott G. Ainsworth • Michael L. Nelson 21 2005-04-22 00:17:52
    22. 22. JointConferenceonDigitalLibraries(JCDL)2013 SLIDING TARGET 7/23/13 Scott G. Ainsworth • Michael L. Nelson 22 2005-03-31 09:16:10
    23. 23. JointConferenceonDigitalLibraries(JCDL)2013 STICKY TARGET What if the target is held steady? (Enabled by Memento API) 7/23/13 Scott G. Ainsworth • Michael L. Nelson 23
    24. 24. JointConferenceonDigitalLibraries(JCDL)2013 2005-05-14STICKY TARGET 7/23/13 Scott G. Ainsworth • Michael L. Nelson 24 MementoFoxExtension 2005-05-14 01:36:08
    25. 25. JointConferenceonDigitalLibraries(JCDL)2013 STICKY TARGET 7/23/13 Scott G. Ainsworth • Michael L. Nelson 25 2005-04-22 00:17:52
    26. 26. JointConferenceonDigitalLibraries(JCDL)2013 STICKY TARGET 7/23/13 Scott G. Ainsworth • Michael L. Nelson 26 2005-05- 14 01:36:08
    27. 27. JointConferenceonDigitalLibraries(JCDL)2013 MEDIAN DRIFT BY STEP Median Drift by Step Step Number MedianDrift(Months) 1 10 20 30 40 50 01m2m3m API UI ● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●● ●●●● ● ● ●●●●●●●● ● ● ● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●● ●● ●● ●●●●● ● ● ● ● ●●●● ● ● ● Sliding ● Sticky MedianDrift(months) 7/23/13 Scott G. Ainsworth • Michael L. Nelson 27 Step Number JCDL’13
    28. 28. JointConferenceonDigitalLibraries(JCDL)2013 CONTENTS  Motivation  Related work  Preliminary work  How much of the Web is archived  Temporal Drift  Temporal Spread  Future work  Conclusion 7/23/13 Scott G. Ainsworth • Michael L. Nelson 28
    29. 29. JointConferenceonDigitalLibraries(JCDL)2013 TEMPORAL SPREAD 7/23/13 Scott G. Ainsworth • Michael L. Nelson 29
    30. 30. JointConferenceonDigitalLibraries(JCDL)2013 COMPOSITE MEMENTO PRESENTATION STRUCTURE 7/23/13 Scott G. Ainsworth • Michael L. Nelson 30 URI-M0 URI-M1 URI-M2 URI-Mi-1 ... URI-Mi URI-Mi+1 URI-Mn...
    31. 31. JointConferenceonDigitalLibraries(JCDL)2013 TEMPORAL SPREAD 7/23/13 Scott G. Ainsworth • Michael L. Nelson 31 2005-05- 14 01:36:08 +9 days +18 days +18 days +7 months +2.1 years
    32. 32. JointConferenceonDigitalLibraries(JCDL)2013 EMBEDDED RESOURCES Resource Memento-Datetime Delta Resource Memento- Datetime Delta http://www.cs.odu.edu 2005-05-14 01:36:08 spacer.gif 2005-06-01 16:23:10 18.6 d mm_menu.js 2005-05-23 02:39:12 9.0 d jimcheng.gif 2005-06-01 16:37:39 18.6 d style.css 2005-05-23 02:39:39 9.0 d jsmith.gif 2005-06-01 16:58:50 18.6 d gfx-logo-odu-crown.gif 2005-05-23 02:39:39 9.0 d rmenu_1st_featured_alumni.png 2005-06-01 21:21:45 18.8 d ddmenu_ddown.js 2005-05-23 02:39:43 9.0 d hmenu_college_...-new.png 2005-12-21 20:14:25 7.3 mo university.js 2005-05-23 02:39:56 9.0 d rmenu_1st_upcoming_news.png 2005-12-21 20:15:14 7.3 mo rmenu_1st_about.png 2005-06-01 13:40:25 18.5 d rmenu_1st_upcoming_events.png 2005-12-21 21:01:12 7.3 mo rmenu_bottom_229.gif 2005-06-01 14:07:29 18.5 d lmenu_1st_resources.png 2005-12-28 17:47:41 7.5 mo shadow-bl.gif 2005-06-01 14:55:53 18.6 d bullet_blue_triangle.gif 2005-12-28 19:43:48 7.5 mo ecsbdg.jpg 2005-06-01 14:56:17 18.6 d logo-cs.gif 2005-12-28 19:54:29 7.5 mo shadow-br.gif 2005-06-01 15:18:18 18.6 d rmenu_1st_featured_student.png 2007-06-12 02:36:07 2.1 years gfx-btn-go-dblue.gif 2005-06-01 15:34:19 18.6 d shadow-b.gif 2007-06-21 02:35:17 2.1 years shadow-tr.gif 2005-06-01 15:55:57 18.6 d shadow-r.gif 404 Not Found header-right1.gif 2005-06-01 16:06:16 18.6 d 7/23/13 Scott G. Ainsworth • Michael L. Nelson 32 Embedded Resources 26 Mean Delta 125.9 days Standard Deviation 207.7 days Spread 2.1 years
    33. 33. JointConferenceonDigitalLibraries(JCDL)2013 REPRESENTING SPREAD COMPOSITE MEMENTO TEMPORAL SPREAD CHART 7/23/13 Scott G. Ainsworth • Michael L. Nelson 33 URI-M0 URI-M1 URI-M2 URI-M3 Root Embedded Same Domain Reused
    34. 34. JointConferenceonDigitalLibraries(JCDL)2013 TEMPORAL SPREAD – ODU CS 7/23/13 Scott G. Ainsworth • Michael L. Nelson 34
    35. 35. JointConferenceonDigitalLibraries(JCDL)2013 TEMPORAL COHERENCE 7/23/13 Scott G. Ainsworth • Michael L. Nelson 35 root emb1 1 Memento, Bracketed Root
    36. 36. JointConferenceonDigitalLibraries(JCDL)2013 TEMPORAL COHERENCE 7/23/13 Scott G. Ainsworth • Michael L. Nelson 36 1 Memento, Bracketed Root Last- Modified root emb1
    37. 37. JointConferenceonDigitalLibraries(JCDL)2013 TEMPORAL COHERENCE 7/23/13 Scott G. Ainsworth • Michael L. Nelson 37 1 Memento, Bracketed Root Last- Modified Last-Modified ≤ root ≤ emb1 Þ coherent root emb1
    38. 38. JointConferenceonDigitalLibraries(JCDL)2013 TEMPORAL COHERENCE 7/23/13 Scott G. Ainsworth • Michael L. Nelson 38 1 Memento, Root Not Bracketed Last- Modifiedroot emb1
    39. 39. JointConferenceonDigitalLibraries(JCDL)2013 TEMPORAL COHERENCE 7/23/13 Scott G. Ainsworth • Michael L. Nelson 39 1 Memento, Root Not Bracketed Last- Modified root ≤ Last-Modified ≤ emb1 Þ violation root emb1
    40. 40. JointConferenceonDigitalLibraries(JCDL)2013 TEMPORAL COHERENCE 7/23/13 Scott G. Ainsworth • Michael L. Nelson 40 1 Memento, No Last-Modified
    41. 41. JointConferenceonDigitalLibraries(JCDL)2013 TEMPORAL COHERENCE 7/23/13 Scott G. Ainsworth • Michael L. Nelson 41 n/a Last-Modified ≤ emb < root Þ possibly coherent rootembn 1 Memento, Before Root
    42. 42. JointConferenceonDigitalLibraries(JCDL)2013 TEMPORAL COHERENCE 7/23/13 Scott G. Ainsworth • Michael L. Nelson 42 2 Mementos, Root Not Bracketed Last- Modifiedroot emb1
    43. 43. JointConferenceonDigitalLibraries(JCDL)2013 TEMPORAL COHERENCE 7/23/13 Scott G. Ainsworth • Michael L. Nelson 43 2 Mementos, Root Not Bracketed Last- Modified n/a root embi+1 embi
    44. 44. JointConferenceonDigitalLibraries(JCDL)2013 TEMPORAL COHERENCE 7/23/13 Scott G. Ainsworth • Michael L. Nelson 44 2 Mementos, Use Content – Similarity
    45. 45. JointConferenceonDigitalLibraries(JCDL)2013 TEMPORAL COHERENCE 7/23/13 Scott G. Ainsworth • Michael L. Nelson 45 2 Mementos, Contents Equal or Equivalent
    46. 46. JointConferenceonDigitalLibraries(JCDL)2013 TEMPORAL COHERENCE 7/23/13 Scott G. Ainsworth • Michael L. Nelson 46 2 Mementos, Contents Not Equal or Equivalent
    47. 47. JointConferenceonDigitalLibraries(JCDL)2013 FIRST EXPERIMENT • 1,000 URIs from DMOZ (Open Directory) • Download all timemaps • Download all composite mementos • Download all embedded resources • Single and Multiple Archives • Four Heuristics 7/23/13 Scott G. Ainsworth • Michael L. Nelson 47
    48. 48. JointConferenceonDigitalLibraries(JCDL)2013 PRELIMINARY RESULTS 1 Count Description Percent 1,000 Root URI-Rs 910 Root timemaps 91% 87,847 Root URI-Ms in timemaps 96.5 URI-Ms per Root URI-R 85,570 Root memento downloaded 97% 1,488,420 Embedded URI-Rs 17.4 Embedded URI-Rs per Root memento 7/23/13 Scott G. Ainsworth • Michael L. Nelson 48
    49. 49. JointConferenceonDigitalLibraries(JCDL)2013 PRELIMINARY RESULTS 2 Description Minimize Distance, Single Archive Minimize Distance, Multi- Archive 3-Month Window, Multi- Archive Embedded URI-Rs 1,488,440 1,488,420 1,447,351 Embedded URI-Ms in timemaps 1,169,787 1,186,456 500,541 URI-M/Embedded URI-R 0.79 0.80 0.35 % Complete 73.8% 75.4% 33.8% Mean spread 200.2 200.1 15.1 Standard Deviation 219.2 219.9 14.3 7/23/13 Scott G. Ainsworth • Michael L. Nelson 49
    50. 50. JointConferenceonDigitalLibraries(JCDL)2013 CURRENT EXPERIMENT • 4,000 URIs from JCDL’11 “How Much…” paper • 1 URI/month vice all • Target WSDM 2013 7/23/13 Scott G. Ainsworth • Michael L. Nelson 50
    51. 51. JointConferenceonDigitalLibraries(JCDL)2013 CONTENTS  Motivation  Related work  Preliminary work  Future work  Conclusion 7/23/13 Scott G. Ainsworth • Michael L. Nelson 51
    52. 52. JointConferenceonDigitalLibraries(JCDL)2013 FUTURE WORK Browsing Patterns, Clusters & Drift • AlNoamany et al. – Real-world access patterns • Domains users avoid – link farms, etc. • Domain clusters 7/23/13 Scott G. Ainsworth • Michael L. Nelson 52
    53. 53. JointConferenceonDigitalLibraries(JCDL)2013 FUTURE WORK Timemaps, Redirection, Missing Mementos • Timemaps only tell part of the story • URI-R redirection (302 from source) • URI-M redirection (Archive action) • Mementos in timemaps but not accessible • Policies must consider user needs • Leave it missing • Show “best” substitute 7/23/13 Scott G. Ainsworth • Michael L. Nelson 53
    54. 54. JointConferenceonDigitalLibraries(JCDL)2013 FUTURE WORK Similarity & Duplication • Delta are currently | root – embedded | • If bracketing mementos are identical, should delta be zero? • HTML is usually modified by the archive • Can’t check for equality • Shingling? SimHash? 7/23/13 Scott G. Ainsworth • Michael L. Nelson 54 0 +30d–30d
    55. 55. JointConferenceonDigitalLibraries(JCDL)2013 FUTURE WORK Communicating Status 7/23/13 Scott G. Ainsworth • Michael L. Nelson 55 Coherent Partial Incoherent Missing
    56. 56. JointConferenceonDigitalLibraries(JCDL)2013 FUTURE WORK Policies & Heuristics • Drift • Sliding target • Sticky target • Spread • Minimize distance • Past only • Past preferred • Near or within distance • Single vs. multi-archive • Refine to meet user expectations 7/23/13 Scott G. Ainsworth • Michael L. Nelson 56
    57. 57. JointConferenceonDigitalLibraries(JCDL)2013 CONTENTS  Motivation  Related work  Preliminary work  Future work  Conclusion 7/23/13 Scott G. Ainsworth • Michael L. Nelson 57
    58. 58. JointConferenceonDigitalLibraries(JCDL)2013 CONCLUSION Extensive research on improving acquisition exists Best use of existing collections needs study We are looking at • Characterizing existing holdings • Policies that minimize impact of drift and spread • Characterizing memento and walk status 7/23/13 Scott G. Ainsworth • Michael L. Nelson 58
    59. 59. JointConferenceonDigitalLibraries(JCDL)2013 TIMELINE Spread Policy Drift CIKM May '13 paper Nov '13 paper Feb '14 Missing mementos, duplicates, similarity, icon Ph.D. dissertation Candidacy Aug '13 Defense Nov '14 paper Spring '14 "Human" patterns, clusters, spam sites 7/23/13 Scott G. Ainsworth • Michael L. Nelson 59

    ×