Farl web archiving

419 views
348 views

Published on

Published in: Technology, Design
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
419
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Farl web archiving

  1. 1. A survey of web-based art resources withfindings applicable to FARL electronic recordscollection developmentAlison Rhonemus, LIS 698, Seminar and Practicum, Dr. Tula GianniniFrick Art Reference LibraryDeborah Kempe, Chief, Collections Management & AccessWeb Survey and Collection DevelopmentCoffee on the terrace
  2. 2. M-LEAD-TWOIntern enterprises -"collection assessments, digital resource surveys,web archiving, provide support for importantconsortial programs such as shared resources"● Brooklyn Museum: Mark Daly, Ronnette Hope,Project Manager: Emily Atwater● NYARC Latin American Resources (MOMA):Ralph Baylor● FARL: Gretchen Nadasky, Alison Rhonemus
  3. 3. Frick Art Reference LibraryIn early 2011, the Frick Art Reference Libraryand the Thomas J. Watson Library at TheMetropolitan Museum of Art completed a pilotproject to address coordinated collecting ofborn-digital auction catalogs using ContentDMand Archive-It.
  4. 4. FARL web archiving program is situated in Collection Development.Current plans for website capture include online auction catalogs and art web resourcescataloged by NYARC.Fellow MLEAD-TWO intern Gretchen Nadasky has just described online auctioncatalogs.My project focused on NYARC cataloged websites.
  5. 5. Web Archiving"The Internet Archive is already doing it.”Actually, the IA is providing the tools forother institutions to use in archiving.
  6. 6. ARCHIVE - ITuses open source tools developed by theInternet Archive● Heritrix Web Crawler● Wayback Interface● WARC format, an ISO standard
  7. 7. the report and manual checksPartner and WAYBACK interfaceQuality Assurance
  8. 8. • Password protected sites – can not be archived• Javascript – more complicated implementationcan be difficult to capture and display. Ongoingarea of development.• Videos -- difficulty with some proprietary formats• Form and Database driven content --‐ may bearchived using a sitemap or other direct links to thecontent.Evaluating seeds
  9. 9. Robots.txt BlocksThe crawler by default respects all robots.txt files. Checkpost--‐crawl reports for blocked seeds or documentsIf your site is blocked:a) Contact the site owner and ask if they will un--‐blockb) Ask your Partner Specialist to turn on “ignore robots”feature in your accountNotes:/ denotes single directory seedsubdomains.archive.org (add individually or expand seed)
  10. 10. Site Survey Criteria● html/flash/pdf● images● embedded material● links● directories and subdomains● terms, rights statements and permissions
  11. 11. Obvious ruse
  12. 12. More of the obviousSites created without the intention ofbeing archived are the sites in need ofarchiving.
  13. 13. Survey Says● 257 cataloged entries● 168 resources are possible to capture● 82 resources would require more research ordisplay definite red flags for web archiving.● PDFs are available for at least some of thecontent in 75 resources.● Flash was an element in 23 resources● 16 sites used HTML5● 54 used a CMS like Drupal or WordPress
  14. 14. There were 3 cataloged resources no longeravailable on the live web but viewable throughInternet Archive.Another 2 defunct resources were not availablethrough Internet Archive.The main page for one of these lost resources wasavailable as a snapshot in WAYBACK but the actualcataloged resource was not available.
  15. 15. Change is ConstantArchive-It Updates:● Heritrix 1 series to Heritrix 3 series(February)● Archive-It 4.8(May)
  16. 16. Archive-It 4.8
  17. 17. Plans● Upcoming grants● Capture of NYARC institution websites● Include Wayback interface links inArcade catalog records● Continue to identify websites forcapture and implement capture
  18. 18. Conclusions○ Digital resources not prevalent enough toreassign current staff○ Website capture most costly in terms of staff time○ Copyright continues to be an issue○ Long term digital preservation needs yet to beassessed○ Capture of Frick Collection sites and NYARC willpose as a challenging test case

×