Tool Academy: Web Archiving
Nicholas Taylor
@nullhandle
Digital Cultural Heritage DC Meetup
December 20, 2012 “cobwebbed s...
what does a web archive look
like?
W/ARC (web archive
container format) flat files, directory tree
“Buckets and Buckets” b...
CAPTURE TOOLS
“Crab traps?” by Flickr user Aviruthia under CC BY-NC-SA 2.0
HTTrack
• small-scale website
copier
• recreates website
structure as
filesystem hierarchy
• Windows GUI or CLI
• *nix loc...
Heritrix
• web-scale archival
crawler
• WARC output
• configure and run
through web service
• Java app, runs best
on *nix
...
Wget
• retrieve Internet-
accessible files
• supports WARC
output
• CLI utility
http://archiveteam.org/index.php?title=Wge...
WARCreate
• archive single
webpage(?) to
WARC
• Chrome extension
• no production
release yet
• may eventually
bundle a sel...
Warrick
• reconstruct website
from web archives
• uses Memento
protocol
• web service or
downloadable Perl
script
http://w...
ArchiveFacebook
• archive an
individual
(authenticated)
Facebook profile
• Firefox add-on
https://addons.mozilla.org/en-US...
REPLAY TOOLS
“Replay” by Flickr user shareski under CC BY-NC 2.0
Wayback Machine
• replay web
resources stored in
WARC and ARC files
• web service
provided by
Internet Archive
• also, dow...
MementoFox
• federated discovery
of web archive
resources
• uses Memento
protocol
• utility limited by
paucity of
aggregat...
WORKFLOW TOOLS
“Replay” by Flickr user shareski under CC BY-NC 2.0
Web Curator Tool
• permissioning, job
scheduling,
harvesting, quality
review, storing
descriptive
metadata
• coupled with
...
NetarchiveSuite
• job scheduling,
data transfer to
preservation
system, proxy
replay
• built for national
domain crawls
• ...
CINCH
• batch retrieval of
Internet-accessible
documents and
transfer to
preservation system
• web service for NC
state go...
HOSTED SERVICES
“Services” by Flickr user spodzone under CC BY-NC-ND 2.0
Archive-It
• integrated web
archiving platform
• uses Heritrix and
Wayback Machine
• contract service
provided by
Internet...
Web Archiving Service
• integrated web
archiving platform
• uses Heritrix and
Wayback Machine
• contract service
provided ...
FILE UTILITIES
“Begin at the Beginning” by Flickr user kate e. did under CC BY-NC-SA 2.0
HTTrack2Arc
• convert HTTrack
output to ARC
format
• CLI Java utility
http://code.google.com/p/httrack2arc/
warc-tools
• parse and re-write
WARC files
• convert ARC files to
WARC files
• no production
release yet
• CLI Python util...
Web Archive Transformation
(WAT) Utilities
• extract metadata
from WARCs for
data analysis
• read data from
local, http, o...
thank you!
Nicholas Taylor
@nullhandle
Upcoming SlideShare
Loading in...5
×

Tool Academy: Web Archiving

5,408

Published on

Presentation given at the Digital Cultural Humanities DC Meetup on the gamut of web archiving tools.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,408
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
15
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • http://www.httrack.com/
  • https://webarchive.jira.com/wiki/display/Heritrix/Heritrix
  • http://archiveteam.org/index.php?title=Wget_with_WARC_output
  • http://matkelly.com/warcreate/
  • http://warrick.cs.odu.edu/
  • https://addons.mozilla.org/en-US/firefox/addon/archivefacebook/
  • https://github.com/internetarchive/wayback
  • https://addons.mozilla.org/en-us/firefox/addon/mementofox/
  • http://webcurator.sourceforge.net/
  • https://sbforge.org/display/NAS/NetarchiveSuite
  • http://cinch.nclive.org/Cinch/
  • http://www.archive-it.org/
  • http://webarchives.cdlib.org/was
  • http://code.google.com/p/httrack2arc/
  • http://code.hanzoarchives.com/warc-tools
  • https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Transformation+%28WAT%29+Specification,+Utilities,+and+Usage+Overview
  • Tool Academy: Web Archiving

    1. 1. Tool Academy: Web Archiving Nicholas Taylor @nullhandle Digital Cultural Heritage DC Meetup December 20, 2012 “cobwebbed screw driver” by Flickr user Colby Gutierrez-Kraybill under CC BY 2.0
    2. 2. what does a web archive look like? W/ARC (web archive container format) flat files, directory tree “Buckets and Buckets” by Flickr user Josh Kenzer under CC BY-NC-SA 2.0 “Files” by Flickr user Artform Canada under CC BY-NC-ND 2.0
    3. 3. CAPTURE TOOLS “Crab traps?” by Flickr user Aviruthia under CC BY-NC-SA 2.0
    4. 4. HTTrack • small-scale website copier • recreates website structure as filesystem hierarchy • Windows GUI or CLI • *nix local web service or CLI http://www.httrack.com/
    5. 5. Heritrix • web-scale archival crawler • WARC output • configure and run through web service • Java app, runs best on *nix https://webarchive.jira.com/wiki/display/Heritrix/Heritrix
    6. 6. Wget • retrieve Internet- accessible files • supports WARC output • CLI utility http://archiveteam.org/index.php?title=Wget_with_WARC_output
    7. 7. WARCreate • archive single webpage(?) to WARC • Chrome extension • no production release yet • may eventually bundle a self- contained Wayback Machine http://matkelly.com/warcreate/
    8. 8. Warrick • reconstruct website from web archives • uses Memento protocol • web service or downloadable Perl script http://warrick.cs.odu.edu/
    9. 9. ArchiveFacebook • archive an individual (authenticated) Facebook profile • Firefox add-on https://addons.mozilla.org/en-US/firefox/addon/archivefacebook/
    10. 10. REPLAY TOOLS “Replay” by Flickr user shareski under CC BY-NC 2.0
    11. 11. Wayback Machine • replay web resources stored in WARC and ARC files • web service provided by Internet Archive • also, downloadable software package • Java app (Tomcat), runs best on *nix https://github.com/internetarchive/wayback
    12. 12. MementoFox • federated discovery of web archive resources • uses Memento protocol • utility limited by paucity of aggregated indexes https://addons.mozilla.org/en-us/firefox/addon/mementofox/
    13. 13. WORKFLOW TOOLS “Replay” by Flickr user shareski under CC BY-NC 2.0
    14. 14. Web Curator Tool • permissioning, job scheduling, harvesting, quality review, storing descriptive metadata • coupled with Heritrix v1.0 • Java app (Tomcat) http://webcurator.sourceforge.net/
    15. 15. NetarchiveSuite • job scheduling, data transfer to preservation system, proxy replay • built for national domain crawls • coupled with Heritrix v1.0 • Java app (JMS) https://sbforge.org/display/NAS/NetarchiveSuite
    16. 16. CINCH • batch retrieval of Internet-accessible documents and transfer to preservation system • web service for NC state government • also, downloadable software package • runs on *nix http://cinch.nclive.org/Cinch/
    17. 17. HOSTED SERVICES “Services” by Flickr user spodzone under CC BY-NC-ND 2.0
    18. 18. Archive-It • integrated web archiving platform • uses Heritrix and Wayback Machine • contract service provided by Internet Archive http://www.archive-it.org/
    19. 19. Web Archiving Service • integrated web archiving platform • uses Heritrix and Wayback Machine • contract service provided by California Digital Library http://webarchives.cdlib.org/was
    20. 20. FILE UTILITIES “Begin at the Beginning” by Flickr user kate e. did under CC BY-NC-SA 2.0
    21. 21. HTTrack2Arc • convert HTTrack output to ARC format • CLI Java utility http://code.google.com/p/httrack2arc/
    22. 22. warc-tools • parse and re-write WARC files • convert ARC files to WARC files • no production release yet • CLI Python utilities http://code.hanzoarchives.com/warc-tools
    23. 23. Web Archive Transformation (WAT) Utilities • extract metadata from WARCs for data analysis • read data from local, http, or hdfs- accessible W/ARCs • output JSON https://webarchive.jira.com/wiki/display/Ire search/Web+Archive+Transformation+ %28WAT%29+Specification,+Utilities, +and+Usage+Overview
    24. 24. thank you! Nicholas Taylor @nullhandle
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×