web archiving tools and technologies
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

web archiving tools and technologies

on

  • 1,039 views

for the Web Archiving workshop at IS&T Archiving 2013 in Washington, DC

for the Web Archiving workshop at IS&T Archiving 2013 in Washington, DC

Statistics

Views

Total Views
1,039
Views on SlideShare
971
Embed Views
68

Actions

Likes
3
Downloads
2
Comments
0

2 Embeds 68

https://twitter.com 67
http://www.geschiedenis24.nl 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

web archiving tools and technologies Presentation Transcript

  • 1. Web Archiving Tools and Technology Dan Chudnov - GWU Libraries dchud at gwu edu @dchud IS&T Workshop, April 2, 2013 Washington DC USATuesday, April 2, 13
  • 2. select scope crawl process access unt nom X X tool heritrix X X wct X X X X netarchive X X X X X suite warc tools X nutchwax X X wayback XTuesday, April 2, 13
  • 3. select • what to collect • who authorizes • when • what orderTuesday, April 2, 13
  • 4. scope •how much • robots.txt • what to leave out • which doors not to openTuesday, April 2, 13
  • 5. crawl • start with seeds • find, queue, follow links • be kind to each site • parallelize across sites • schedule, log, checkpoint, resume • bundleTuesday, April 2, 13
  • 6. process • lump, split, bundle, rebundle • quality control • index, surrogate, reorder, prep for access • store, distribute, preserveTuesday, April 2, 13
  • 7. access • browse • search • known items • patterns • needlesTuesday, April 2, 13
  • 8. select scope crawl process access unt nom X X tool heritrix X X wct X X X X netarchive X X X X X suite warc tools X nutchwax X X wayback XTuesday, April 2, 13
  • 9. UNT URL Nomination Tool • collaborative selection • collect seed lists • attach metadata • agree on scope • feed crawlersTuesday, April 2, 13
  • 10. heritrix • free software from Internet Archive • easy to start with • difficult to master • powerful, configurable, confusingTuesday, April 2, 13
  • 11. heritrix cont’d • two major versions, “1” and “3” • WCT and NetArchive embed “1” • “1” - minimal UI • “3” - even less • iterate early - long learning curve • best available toolTuesday, April 2, 13
  • 12. heritrix cont’dTuesday, April 2, 13
  • 13. Web Curator Tool • free software from NLNZ / BL • full crawling workflow suite • select, obtain permissions, authorize • schedule, crawl w/ heritrix 1Tuesday, April 2, 13
  • 14. WCT cont’d • quality review • statistics, hierarchy visualization, pruning • troubleshooting • task notifications • reportingTuesday, April 2, 13
  • 15. WCT cont’dTuesday, April 2, 13
  • 16. NetarchiveSuite • free software from netarkivet.dk • used by State and University Library, The Royal Library in Denmark • complete solution from selection to accessTuesday, April 2, 13
  • 17. NetarchiveSuite cont’dTuesday, April 2, 13
  • 18. NetarchiveSuite cont’d • selection, scoping, scheduling • crawling, troubleshooting, tweaking • system dashboard, quality assurance • heritrix and waybackTuesday, April 2, 13
  • 19. warc-tools • command-line tools for arc/warc • validate, summarize, filter • bundle / rebundle, convert, indexTuesday, April 2, 13
  • 20. NutchWax • free software • index / search of ARC data • development slowed / stopped but still usedTuesday, April 2, 13
  • 21. searching web archives is hardTuesday, April 2, 13
  • 22. wayback • free software from Internet Archive • public web access to web archives • what you’ve seen at archive.orgTuesday, April 2, 13
  • 23. wayback cont’dTuesday, April 2, 13
  • 24. wayback cont’dTuesday, April 2, 13
  • 25. Tuesday, April 2, 13