Your SlideShare is downloading. ×
Tool Academy: Web Archiving
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Tool Academy: Web Archiving

5,285
views

Published on

Presentation given at the Digital Cultural Humanities DC Meetup on the gamut of web archiving tools.

Presentation given at the Digital Cultural Humanities DC Meetup on the gamut of web archiving tools.

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,285
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
15
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • http://www.httrack.com/
  • https://webarchive.jira.com/wiki/display/Heritrix/Heritrix
  • http://archiveteam.org/index.php?title=Wget_with_WARC_output
  • http://matkelly.com/warcreate/
  • http://warrick.cs.odu.edu/
  • https://addons.mozilla.org/en-US/firefox/addon/archivefacebook/
  • https://github.com/internetarchive/wayback
  • https://addons.mozilla.org/en-us/firefox/addon/mementofox/
  • http://webcurator.sourceforge.net/
  • https://sbforge.org/display/NAS/NetarchiveSuite
  • http://cinch.nclive.org/Cinch/
  • http://www.archive-it.org/
  • http://webarchives.cdlib.org/was
  • http://code.google.com/p/httrack2arc/
  • http://code.hanzoarchives.com/warc-tools
  • https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Transformation+%28WAT%29+Specification,+Utilities,+and+Usage+Overview
  • Transcript

    • 1. Tool Academy: Web Archiving Nicholas Taylor @nullhandle Digital Cultural Heritage DC Meetup December 20, 2012 “cobwebbed screw driver” by Flickr user Colby Gutierrez-Kraybill under CC BY 2.0
    • 2. what does a web archive look like? W/ARC (web archive container format) flat files, directory tree “Buckets and Buckets” by Flickr user Josh Kenzer under CC BY-NC-SA 2.0 “Files” by Flickr user Artform Canada under CC BY-NC-ND 2.0
    • 3. CAPTURE TOOLS “Crab traps?” by Flickr user Aviruthia under CC BY-NC-SA 2.0
    • 4. HTTrack • small-scale website copier • recreates website structure as filesystem hierarchy • Windows GUI or CLI • *nix local web service or CLI http://www.httrack.com/
    • 5. Heritrix • web-scale archival crawler • WARC output • configure and run through web service • Java app, runs best on *nix https://webarchive.jira.com/wiki/display/Heritrix/Heritrix
    • 6. Wget • retrieve Internet- accessible files • supports WARC output • CLI utility http://archiveteam.org/index.php?title=Wget_with_WARC_output
    • 7. WARCreate • archive single webpage(?) to WARC • Chrome extension • no production release yet • may eventually bundle a self- contained Wayback Machine http://matkelly.com/warcreate/
    • 8. Warrick • reconstruct website from web archives • uses Memento protocol • web service or downloadable Perl script http://warrick.cs.odu.edu/
    • 9. ArchiveFacebook • archive an individual (authenticated) Facebook profile • Firefox add-on https://addons.mozilla.org/en-US/firefox/addon/archivefacebook/
    • 10. REPLAY TOOLS “Replay” by Flickr user shareski under CC BY-NC 2.0
    • 11. Wayback Machine • replay web resources stored in WARC and ARC files • web service provided by Internet Archive • also, downloadable software package • Java app (Tomcat), runs best on *nix https://github.com/internetarchive/wayback
    • 12. MementoFox • federated discovery of web archive resources • uses Memento protocol • utility limited by paucity of aggregated indexes https://addons.mozilla.org/en-us/firefox/addon/mementofox/
    • 13. WORKFLOW TOOLS “Replay” by Flickr user shareski under CC BY-NC 2.0
    • 14. Web Curator Tool • permissioning, job scheduling, harvesting, quality review, storing descriptive metadata • coupled with Heritrix v1.0 • Java app (Tomcat) http://webcurator.sourceforge.net/
    • 15. NetarchiveSuite • job scheduling, data transfer to preservation system, proxy replay • built for national domain crawls • coupled with Heritrix v1.0 • Java app (JMS) https://sbforge.org/display/NAS/NetarchiveSuite
    • 16. CINCH • batch retrieval of Internet-accessible documents and transfer to preservation system • web service for NC state government • also, downloadable software package • runs on *nix http://cinch.nclive.org/Cinch/
    • 17. HOSTED SERVICES “Services” by Flickr user spodzone under CC BY-NC-ND 2.0
    • 18. Archive-It • integrated web archiving platform • uses Heritrix and Wayback Machine • contract service provided by Internet Archive http://www.archive-it.org/
    • 19. Web Archiving Service • integrated web archiving platform • uses Heritrix and Wayback Machine • contract service provided by California Digital Library http://webarchives.cdlib.org/was
    • 20. FILE UTILITIES “Begin at the Beginning” by Flickr user kate e. did under CC BY-NC-SA 2.0
    • 21. HTTrack2Arc • convert HTTrack output to ARC format • CLI Java utility http://code.google.com/p/httrack2arc/
    • 22. warc-tools • parse and re-write WARC files • convert ARC files to WARC files • no production release yet • CLI Python utilities http://code.hanzoarchives.com/warc-tools
    • 23. Web Archive Transformation (WAT) Utilities • extract metadata from WARCs for data analysis • read data from local, http, or hdfs- accessible W/ARCs • output JSON https://webarchive.jira.com/wiki/display/Ire search/Web+Archive+Transformation+ %28WAT%29+Specification,+Utilities, +and+Usage+Overview
    • 24. thank you! Nicholas Taylor @nullhandle