Tool Academy: Web Archiving

5,732 views
5,624 views

Published on

Presentation given at the Digital Cultural Humanities DC Meetup on the gamut of web archiving tools.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,732
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
15
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • http://www.httrack.com/
  • https://webarchive.jira.com/wiki/display/Heritrix/Heritrix
  • http://archiveteam.org/index.php?title=Wget_with_WARC_output
  • http://matkelly.com/warcreate/
  • http://warrick.cs.odu.edu/
  • https://addons.mozilla.org/en-US/firefox/addon/archivefacebook/
  • https://github.com/internetarchive/wayback
  • https://addons.mozilla.org/en-us/firefox/addon/mementofox/
  • http://webcurator.sourceforge.net/
  • https://sbforge.org/display/NAS/NetarchiveSuite
  • http://cinch.nclive.org/Cinch/
  • http://www.archive-it.org/
  • http://webarchives.cdlib.org/was
  • http://code.google.com/p/httrack2arc/
  • http://code.hanzoarchives.com/warc-tools
  • https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Transformation+%28WAT%29+Specification,+Utilities,+and+Usage+Overview
  • Tool Academy: Web Archiving

    1. 1. Tool Academy: Web Archiving Nicholas Taylor @nullhandle Digital Cultural Heritage DC Meetup December 20, 2012 “cobwebbed screw driver” by Flickr user Colby Gutierrez-Kraybill under CC BY 2.0
    2. 2. what does a web archive look like? W/ARC (web archive container format) flat files, directory tree “Buckets and Buckets” by Flickr user Josh Kenzer under CC BY-NC-SA 2.0 “Files” by Flickr user Artform Canada under CC BY-NC-ND 2.0
    3. 3. CAPTURE TOOLS “Crab traps?” by Flickr user Aviruthia under CC BY-NC-SA 2.0
    4. 4. HTTrack • small-scale website copier • recreates website structure as filesystem hierarchy • Windows GUI or CLI • *nix local web service or CLI http://www.httrack.com/
    5. 5. Heritrix • web-scale archival crawler • WARC output • configure and run through web service • Java app, runs best on *nix https://webarchive.jira.com/wiki/display/Heritrix/Heritrix
    6. 6. Wget • retrieve Internet- accessible files • supports WARC output • CLI utility http://archiveteam.org/index.php?title=Wget_with_WARC_output
    7. 7. WARCreate • archive single webpage(?) to WARC • Chrome extension • no production release yet • may eventually bundle a self- contained Wayback Machine http://matkelly.com/warcreate/
    8. 8. Warrick • reconstruct website from web archives • uses Memento protocol • web service or downloadable Perl script http://warrick.cs.odu.edu/
    9. 9. ArchiveFacebook • archive an individual (authenticated) Facebook profile • Firefox add-on https://addons.mozilla.org/en-US/firefox/addon/archivefacebook/
    10. 10. REPLAY TOOLS “Replay” by Flickr user shareski under CC BY-NC 2.0
    11. 11. Wayback Machine • replay web resources stored in WARC and ARC files • web service provided by Internet Archive • also, downloadable software package • Java app (Tomcat), runs best on *nix https://github.com/internetarchive/wayback
    12. 12. MementoFox • federated discovery of web archive resources • uses Memento protocol • utility limited by paucity of aggregated indexes https://addons.mozilla.org/en-us/firefox/addon/mementofox/
    13. 13. WORKFLOW TOOLS “Replay” by Flickr user shareski under CC BY-NC 2.0
    14. 14. Web Curator Tool • permissioning, job scheduling, harvesting, quality review, storing descriptive metadata • coupled with Heritrix v1.0 • Java app (Tomcat) http://webcurator.sourceforge.net/
    15. 15. NetarchiveSuite • job scheduling, data transfer to preservation system, proxy replay • built for national domain crawls • coupled with Heritrix v1.0 • Java app (JMS) https://sbforge.org/display/NAS/NetarchiveSuite
    16. 16. CINCH • batch retrieval of Internet-accessible documents and transfer to preservation system • web service for NC state government • also, downloadable software package • runs on *nix http://cinch.nclive.org/Cinch/
    17. 17. HOSTED SERVICES “Services” by Flickr user spodzone under CC BY-NC-ND 2.0
    18. 18. Archive-It • integrated web archiving platform • uses Heritrix and Wayback Machine • contract service provided by Internet Archive http://www.archive-it.org/
    19. 19. Web Archiving Service • integrated web archiving platform • uses Heritrix and Wayback Machine • contract service provided by California Digital Library http://webarchives.cdlib.org/was
    20. 20. FILE UTILITIES “Begin at the Beginning” by Flickr user kate e. did under CC BY-NC-SA 2.0
    21. 21. HTTrack2Arc • convert HTTrack output to ARC format • CLI Java utility http://code.google.com/p/httrack2arc/
    22. 22. warc-tools • parse and re-write WARC files • convert ARC files to WARC files • no production release yet • CLI Python utilities http://code.hanzoarchives.com/warc-tools
    23. 23. Web Archive Transformation (WAT) Utilities • extract metadata from WARCs for data analysis • read data from local, http, or hdfs- accessible W/ARCs • output JSON https://webarchive.jira.com/wiki/display/Ire search/Web+Archive+Transformation+ %28WAT%29+Specification,+Utilities, +and+Usage+Overview
    24. 24. thank you! Nicholas Taylor @nullhandle

    ×