The Great WARC
Adventure
Nick Ruest (York University)
Ian Milligan (University of Waterloo)
Today’s Talk
* A brief historical overview of web archiving
* How to capture and preserve
* Discoverability and usability ...
Historical overview of
web archiving
Going back to 1995/1996
We do not know today what Mozart sounded
like on the keyboard ... What will future
generations kno...
The Internet Archive
But that’s not all...
and… [http://freedaleaskey.plggta.
org/]
How to capture and
preserve
https://digital.library.yorku.ca/yul-92640/biafra
WARC?
https://digital.library.yorku.ca/yul-92640/biafra
PROVENANCE!
wget --mirror --page-requisites --warc-
file=THIS_AWESOME_SITE http:
//thisawesomesite.ca
Light weight
Heritrix
Industrial strength
using open source software
Discoverable and usable
$$$
Can we do this with Open
Source software?
Drupal
Fedora Commons
Islandora
Heritrix
wkhtmltopdf
wkhtmltoimage
Use the tools you know
Can we talk OAIS?
Heritrix
Wget
wkhtmltopdf
wkhtmltoimage
...and Bash!
SIP - Submission Information Package
Drupal
Islandora
Fedora Commons
Web ARChive SP
Checksum
Checksum checker
Premis
FITS
AIP - Archival Information Package
Drupal
Islandora
Fedora Commons
Web ARChive SP
DIP - Dissemination Information Package
Interplay of the archivist
and historian
An Ideal Case Study?
A Historian in the Archive
* WARC-Tools (https:
//code.google.
com/p/warc-tools/)
* By Date (awesome!)
A Historian in the Archive
* Don’t read the
comments
* Disqus
A Historian in the Archive
Learning from Word
Frequency
A Historian in the Archive (distant
reading)
A Historian in the Archive (distant
reading)
A Historian in the Archive (distant
reading)
Keywords = Gotta Know
What You’re Looking For
A Historian in the Archive (distant
reading) - search ‘edwin mellen’
A Historian in the Archive (distant
reading) - search ‘librarians’
A Historian in the Archive (distant
reading) - search ‘dale askey’
Problem: still need to
know what you’re looking
for!
A Historian in the Archive (distant
reading)
A Historian in the Archive (distant
reading)
Helps to piece the story
together from massive
web archives
Internet Archive isn’t the
only way!
…but they created the
Web Archiving Lifecycle
Thanks!
Nick Ruest: ruestn@yorku.ca / @ruebot
Ian Milligan: i2milligan@uwaterloo.ca / @ianmilligan1
Ruest and Milligan - The Great WARC Adventure
Ruest and Milligan - The Great WARC Adventure
Ruest and Milligan - The Great WARC Adventure
Ruest and Milligan - The Great WARC Adventure
Ruest and Milligan - The Great WARC Adventure
Ruest and Milligan - The Great WARC Adventure
Ruest and Milligan - The Great WARC Adventure
Ruest and Milligan - The Great WARC Adventure
Ruest and Milligan - The Great WARC Adventure
Ruest and Milligan - The Great WARC Adventure
Ruest and Milligan - The Great WARC Adventure
Ruest and Milligan - The Great WARC Adventure
Upcoming SlideShare
Loading in …5
×

Ruest and Milligan - The Great WARC Adventure

291 views

Published on

The presentation at Nick Ruest (York University) and I gave at the Association of Canadian Archivists' annual meeting in Victoria, BC on 26 June 2014.

Published in: Internet, Technology, Education
  • Be the first to comment

  • Be the first to like this

Ruest and Milligan - The Great WARC Adventure

  1. 1. The Great WARC Adventure Nick Ruest (York University) Ian Milligan (University of Waterloo)
  2. 2. Today’s Talk * A brief historical overview of web archiving * How to capture and preserve * Discoverability and usability (with open source software) * Interplay of the archivist and the historian * Piecing the story together * Internet Archive isn’t the only way
  3. 3. Historical overview of web archiving
  4. 4. Going back to 1995/1996 We do not know today what Mozart sounded like on the keyboard ... What will future generations know of our history? ... But digital technology seemed to come to the rescue, allowing indefinite storage without loss. Now we find that digital information too, has its dark side. (Michael Lesk, 1995)
  5. 5. The Internet Archive
  6. 6. But that’s not all...
  7. 7. and… [http://freedaleaskey.plggta. org/]
  8. 8. How to capture and preserve
  9. 9. https://digital.library.yorku.ca/yul-92640/biafra WARC? https://digital.library.yorku.ca/yul-92640/biafra
  10. 10. PROVENANCE!
  11. 11. wget --mirror --page-requisites --warc- file=THIS_AWESOME_SITE http: //thisawesomesite.ca Light weight
  12. 12. Heritrix Industrial strength
  13. 13. using open source software Discoverable and usable
  14. 14. $$$
  15. 15. Can we do this with Open Source software?
  16. 16. Drupal Fedora Commons Islandora Heritrix wkhtmltopdf wkhtmltoimage Use the tools you know
  17. 17. Can we talk OAIS?
  18. 18. Heritrix Wget wkhtmltopdf wkhtmltoimage ...and Bash! SIP - Submission Information Package
  19. 19. Drupal Islandora Fedora Commons Web ARChive SP Checksum Checksum checker Premis FITS AIP - Archival Information Package
  20. 20. Drupal Islandora Fedora Commons Web ARChive SP DIP - Dissemination Information Package
  21. 21. Interplay of the archivist and historian
  22. 22. An Ideal Case Study?
  23. 23. A Historian in the Archive * WARC-Tools (https: //code.google. com/p/warc-tools/) * By Date (awesome!)
  24. 24. A Historian in the Archive * Don’t read the comments * Disqus
  25. 25. A Historian in the Archive
  26. 26. Learning from Word Frequency
  27. 27. A Historian in the Archive (distant reading)
  28. 28. A Historian in the Archive (distant reading)
  29. 29. A Historian in the Archive (distant reading)
  30. 30. Keywords = Gotta Know What You’re Looking For
  31. 31. A Historian in the Archive (distant reading) - search ‘edwin mellen’
  32. 32. A Historian in the Archive (distant reading) - search ‘librarians’
  33. 33. A Historian in the Archive (distant reading) - search ‘dale askey’
  34. 34. Problem: still need to know what you’re looking for!
  35. 35. A Historian in the Archive (distant reading)
  36. 36. A Historian in the Archive (distant reading)
  37. 37. Helps to piece the story together from massive web archives
  38. 38. Internet Archive isn’t the only way!
  39. 39. …but they created the Web Archiving Lifecycle
  40. 40. Thanks! Nick Ruest: ruestn@yorku.ca / @ruebot Ian Milligan: i2milligan@uwaterloo.ca / @ianmilligan1

×