How I spend my summer
vacations
Justin F. Brunelle
WS-DL Research Group
Department of Computer Science
Old Dominion University
WADL 2013
Justin in a nutshell
• PhD Student at ODU
• Dynamic representations
–in the archives
–Improved quality from
–archived data
–Alter-ego: Application Developer
• at The MITRE Corporation
–Big data & cloud computing
How much can we archive?
The setup
• 1,000 URIs from Twitter
• 1,000 URIs from Archive-it
• Capture with tools
• Study the archivability
Good
Good
Good
Meh…
Zombies in the Archives
Bad
Bad
Bad
Bad
Bad
Why?
Losing the Moment
• What we share != What we curate
• 4.2% of Twitter is perfectly archived
–Losing My Revolution: 11% gone in 2 years
• 34.2% of Archive-it is perfectly archived
• Accessibility? Gov vs. non-Gov?
Measuring memento damage
VS.
Not all embedded resources are
created equal
Not all embedded resources are
created equal
Planned Work
• Evaluate importance of missing stuff
–Size, position
–# CSS Classes
–Not all stylesheets created equal
– Missing border vs missing functionality
– “Whitespace”
–Provide Web service
• Mechanical Turk evaluation of “damage”
• Evaluate collections of mementos
What does it all mean?
• Archivability is measurable
• Damage is measurable
• If we can predict archivability….
–We can try new methods of archiving on “hard to
capture” mementos
–Attempt repairs on existing mementos
–Gauge our successes in real-time
• Next step: capturing dynamic content
How I spend my summer vacations

How I spend my summer vacations

  • 1.
    How I spendmy summer vacations Justin F. Brunelle WS-DL Research Group Department of Computer Science Old Dominion University WADL 2013
  • 2.
    Justin in anutshell • PhD Student at ODU • Dynamic representations –in the archives –Improved quality from –archived data –Alter-ego: Application Developer • at The MITRE Corporation –Big data & cloud computing
  • 3.
    How much canwe archive?
  • 4.
    The setup • 1,000URIs from Twitter • 1,000 URIs from Archive-it • Capture with tools • Study the archivability
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
    Losing the Moment •What we share != What we curate • 4.2% of Twitter is perfectly archived –Losing My Revolution: 11% gone in 2 years • 34.2% of Archive-it is perfectly archived • Accessibility? Gov vs. non-Gov?
  • 16.
  • 17.
  • 18.
    Not all embeddedresources are created equal
  • 19.
    Not all embeddedresources are created equal
  • 20.
    Planned Work • Evaluateimportance of missing stuff –Size, position –# CSS Classes –Not all stylesheets created equal – Missing border vs missing functionality – “Whitespace” –Provide Web service • Mechanical Turk evaluation of “damage” • Evaluate collections of mementos
  • 21.
    What does itall mean? • Archivability is measurable • Damage is measurable • If we can predict archivability…. –We can try new methods of archiving on “hard to capture” mementos –Attempt repairs on existing mementos –Gauge our successes in real-time • Next step: capturing dynamic content