Digital Preservation 2013

4,760 views

Published on

WARCreate and WAIL: WARC, Wayback and Heritrix Made Easy

Published in: Technology, Education, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,760
On SlideShare
0
From Embeds
0
Number of Embeds
3,290
Actions
Shares
0
Downloads
10
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Introduce ,am here to speak about some of our efforts in building tools for casual archivists hoping to preserve web pages.
  • First start with identifying problem:Digital preservation tools are ill suited for use by individual digital archivistsTools of focus, Htrix and Wayback, while FOSS, require technical know-how.To remedy, individuals can delegate the task of digpres to institutions but this poses many more problemsOne we have investigates are variances in perspective, as examplified by early crawls of Cragslist, which used GeoIP, and thus attached the saved content to the San Fran CL; Variance in perpective relative to tool used, i.e., what crawler sees may not be the same as what we want preserved
  • Those that want to preserve resort to ad hoc techniquesThese techniques produced archives that may not stand test of time due to format issues and bad practice procedures used to save pagesEarly work for FB preservation (AFB) tried to remedy this by making the process consistent by saving all pages in one’s FB profile but was limited in scope, frequently broke due to FB redesigns
  • We saw merit in putting preservation in the hands of those that decide what is important but wanted something:More general purpose – applicable to any webpageUsed standard formats like WARC andTook advantage of the tools that have already been created
  • To conquer this last goal of adapting institutional tools to amateur archivists, we sought to adapt Heritrix, Wayback and other tools and make them more useable.
  • Create WAILTook institutional toolsConfigured for relativityCoded up GUI to interact with toolsAllow crawls to be initiated and interacted with via GUIMade it easy: One Click User-Instigated Preservation
  • Simple working for a one-off crawl:Enter URLHit the Archive Now buttonCheck back later
  • Allow further capability likeservices managementCustom crawlCrawl status checkingAll still GUI-based
  • Digital Preservation 2013

    1. 1. WARCreate and WAIL: WARC, Wayback and Heritrix Made Easy Mat Kelly, Michael L. Nelson, Michele C. Weigle Old Dominion University {mkelly,mln,mweigle}@cs.odu.edu Web Science and Digital Libraries Research Group ws-dl.blogspot.com
    2. 2. The Problem Institutional Tools, Personal Archivists • ON YOUR MACHINE – Complex to Operate – Require Infrastructure • DELEGATED TO INSTITUTIONS – $$$ – Lose original perspective • Locale content tailoring (DC vs. San Francisco) • Observation Medium (PC web browser vs. crawler) 2July 24, 2013 Arlington, Virginia Digital Preservation 2013
    3. 3. The Normal Solution Ad Hoc Approaches • Variable Output • Deviate from standards (e.g., WARC) • Swell for Saving A Copy • Bad Practice for Preservation 3July 24, 2013 Arlington, Virginia Digital Preservation 2013 Archive Facebook
    4. 4. Better Solution • Adapt institutional tools & mediums 4July 24, 2013 Arlington, Virginia Digital Preservation 2013
    5. 5. MAKING THE TOOLS SUITABLE 5July 24, 2013 Arlington, Virginia Digital Preservation 2013
    6. 6. Web Archiving Integration Layer (WAIL) • Packages Wayback, Heritrix and other preservation tools into a GUI • Tools are pre-configured to work together • “One Click User-Instigated Preservation” 6July 24, 2013 Arlington, Virginia Digital Preservation 2013
    7. 7. Working with WAIL (Simple) 7 1. Enter URL 2. Click button • Come back later • Hit VIEW ARCHIVE July 24, 2013 Arlington, Virginia Digital Preservation 2013
    8. 8. Working with WAIL (Custom) 8 • Enter multiple seed URLs (Heritrix tab) • Customize Crawl Parameters • Observe crawl state • Get included tool info • Get meta info on crawls July 24, 2013 Arlington, Virginia Digital Preservation 2013
    9. 9. And More? • Other preservation tools packaged – (e.g., Archive Team’s WARC-Proxy) • GUI is extensible to facilitate further integration of other tools – Currently working to package UKWA’s WARC- Explorer, UKWA’smonitrix, ODU/LANL’smcurl, a custom memento proxy, etc. 9July 24, 2013 Arlington, Virginia Digital Preservation 2013
    10. 10. PRESERVING IN THE ORIGINAL CONTEXT 10July 24, 2013 Arlington, Virginia Digital Preservation 2013
    11. 11. WARCreate Create WARC files from any webpage • Preserves what you see instead of what crawler sees – Capture pages behind authentication – Manipulate then preserve • No more preservation delegation • Created WARCs compatible with WAIL and Wayback instance 11July 24, 2013 Arlington, Virginia Digital Preservation 2013 extension
    12. 12. Ad hoc to Generally Applicable 12 Archive Facebook WARCreate App Type Browser (Firefox) Browser (Chrome) Output Navigable Webpages Web ARCive (WARC) files Target Facebook.com Any website July 24, 2013 Arlington, Virginia Digital Preservation 2013
    13. 13. Working with WARCreate 13 • Browse as usual • Preserve on a whim • WARC output to your Downloads folder July 24, 2013 Arlington, Virginia Digital Preservation 2013
    14. 14. Preserving the Original Context 14 Facebook-Supplied Data Dump Archive created from WARCreate in Wayback July 24, 2013 Arlington, Virginia Digital Preservation 2013
    15. 15. Preserving the Original Context 15 Using Scraping Tools (e.g. wget) Archive created from WARCreate in Wayback July 24, 2013 Arlington, Virginia Digital Preservation 2013
    16. 16. Preserving the Original Context 16 A Crawler Has No Context Archive created from WARCreate in Wayback July 24, 2013 Arlington, Virginia Digital Preservation 2013
    17. 17. Preserving the Original Context 17 IA/HERITRIX OBEY ROBOTS Archive created from WARCreate in Wayback July 24, 2013 Arlington, Virginia Digital Preservation 2013
    18. 18. Preserving Beyond the Surface Web 18July 24, 2013 Arlington, Virginia Digital Preservation 2013
    19. 19. Creating a WARC of Your Twitter Feed (Behind Authentication) 19July 24, 2013 Arlington, Virginia Digital Preservation 2013
    20. 20. Tools’ History June 2012WARCreate presented at Joint Conference on Digital Libraries (JCDL) ’12 * required XAMPP, “local server” July 2012WARCreate presented at Digital Preservation 2012 * NDSA/NDIIPP award for Future Steward February 2013 WARCreate decoupled from XAMPP, WAIL created, presented at Personal Digital Archiving 2013 May 2013 NEH grant begins to “Archive What I See Now”, port of WARCreate to Firefox & Much More July 2013WARCreate re-finalized, 1.0 released, presented at Digital Preservation 2013 21July 24, 2013 Arlington, Virginia Digital Preservation 2013
    21. 21. Filling a Need • Capable tools prevent ad hoc archiving – Keep it familiar • WARCreate as Chrome extension – Or keep it native • WAIL has respective OS look-and-feel • Good Archiving practices only begin with content capture, much to do 22July 24, 2013 Arlington, Virginia Digital Preservation 2013
    22. 22. Available Now! WARCreate.com matkelly.com/wail available for: available for: Web Archiving Integration Layer (WAIL) WARCreate bit.ly/digpres2013

    ×