WARCreate and WAIL:
WARC, Wayback and Heritrix Made Easy
Mat Kelly, Michael L. Nelson, Michele C. Weigle
Old Dominion University
{mkelly,mln,mweigle}@cs.odu.edu
Web Science and Digital Libraries Research Group
ws-dl.blogspot.com
The Problem
Institutional Tools, Personal Archivists
• ON YOUR MACHINE
– Complex to Operate
– Require Infrastructure
• DELEGATED TO INSTITUTIONS
– $$$
– Lose original perspective
• Locale content tailoring (DC vs. San Francisco)
• Observation Medium (PC web browser vs. crawler)
2July 24, 2013
Arlington, Virginia Digital Preservation 2013
The Normal Solution
Ad Hoc Approaches
• Variable Output
• Deviate from standards (e.g., WARC)
• Swell for Saving A Copy
• Bad Practice for Preservation
3July 24, 2013
Arlington, Virginia Digital Preservation 2013
Archive Facebook
Better Solution
• Adapt institutional tools & mediums
4July 24, 2013
Arlington, Virginia Digital Preservation 2013
MAKING THE TOOLS SUITABLE
5July 24, 2013
Arlington, Virginia Digital Preservation 2013
Web Archiving Integration Layer
(WAIL)
• Packages Wayback, Heritrix and other
preservation tools into a GUI
• Tools are pre-configured to work together
• “One Click User-Instigated Preservation”
6July 24, 2013
Arlington, Virginia Digital Preservation 2013
Working with WAIL (Simple)
7
1. Enter URL
2. Click button
• Come back later
• Hit VIEW ARCHIVE
July 24, 2013
Arlington, Virginia Digital Preservation 2013
Working with WAIL (Custom)
8
• Enter multiple seed
URLs (Heritrix tab)
• Customize Crawl
Parameters
• Observe crawl state
• Get included tool info
• Get meta info on crawls
July 24, 2013
Arlington, Virginia Digital Preservation 2013
And More?
• Other preservation tools packaged
– (e.g., Archive Team’s WARC-Proxy)
• GUI is extensible to facilitate further
integration of other tools
– Currently working to package UKWA’s WARC-
Explorer, UKWA’smonitrix, ODU/LANL’smcurl, a
custom memento proxy, etc.
9July 24, 2013
Arlington, Virginia Digital Preservation 2013
PRESERVING IN
THE ORIGINAL CONTEXT
10July 24, 2013
Arlington, Virginia Digital Preservation 2013
WARCreate
Create WARC files from any webpage
• Preserves what you see instead of what
crawler sees
– Capture pages behind authentication
– Manipulate then preserve
• No more preservation delegation
• Created WARCs compatible with WAIL and
Wayback instance
11July 24, 2013
Arlington, Virginia Digital Preservation 2013
extension
Ad hoc to Generally Applicable
12
Archive Facebook WARCreate
App Type
Browser (Firefox) Browser (Chrome)
Output
Navigable
Webpages
Web ARCive
(WARC) files
Target
Facebook.com Any website
July 24, 2013
Arlington, Virginia Digital Preservation 2013
Working with WARCreate
13
• Browse as usual
• Preserve on a
whim
• WARC output
to your
Downloads folder
July 24, 2013
Arlington, Virginia Digital Preservation 2013
Preserving the Original Context
14
Facebook-Supplied Data Dump
Archive created from
WARCreate in Wayback
July 24, 2013
Arlington, Virginia Digital Preservation 2013
Preserving the Original Context
15
Using Scraping Tools (e.g. wget)
Archive created from
WARCreate in Wayback
July 24, 2013
Arlington, Virginia Digital Preservation 2013
Preserving the Original Context
16
A Crawler Has No Context
Archive created from
WARCreate in Wayback
July 24, 2013
Arlington, Virginia Digital Preservation 2013
Preserving the Original Context
17
IA/HERITRIX OBEY ROBOTS
Archive created from
WARCreate in Wayback
July 24, 2013
Arlington, Virginia Digital Preservation 2013
Preserving Beyond the Surface Web
18July 24, 2013
Arlington, Virginia Digital Preservation 2013
Creating a WARC of Your Twitter Feed
(Behind Authentication)
19July 24, 2013
Arlington, Virginia Digital Preservation 2013
Tools’ History
June 2012WARCreate presented at
Joint Conference on Digital Libraries (JCDL) ’12
* required XAMPP, “local server”
July 2012WARCreate presented at
Digital Preservation 2012
* NDSA/NDIIPP award for Future Steward
February 2013 WARCreate decoupled from XAMPP, WAIL
created, presented at
Personal Digital Archiving 2013
May 2013 NEH grant begins to “Archive What I See Now”,
port of WARCreate to Firefox & Much More
July 2013WARCreate re-finalized, 1.0 released, presented
at Digital Preservation 2013
21July 24, 2013
Arlington, Virginia Digital Preservation 2013
Filling a Need
• Capable tools prevent ad hoc archiving
– Keep it familiar
• WARCreate as Chrome extension
– Or keep it native
• WAIL has respective OS look-and-feel
• Good Archiving practices only begin with
content capture, much to do
22July 24, 2013
Arlington, Virginia Digital Preservation 2013
Available Now!
WARCreate.com
matkelly.com/wail
available for:
available for:
Web Archiving Integration Layer (WAIL)
WARCreate
bit.ly/digpres2013

Digital Preservation 2013

  • 1.
    WARCreate and WAIL: WARC,Wayback and Heritrix Made Easy Mat Kelly, Michael L. Nelson, Michele C. Weigle Old Dominion University {mkelly,mln,mweigle}@cs.odu.edu Web Science and Digital Libraries Research Group ws-dl.blogspot.com
  • 2.
    The Problem Institutional Tools,Personal Archivists • ON YOUR MACHINE – Complex to Operate – Require Infrastructure • DELEGATED TO INSTITUTIONS – $$$ – Lose original perspective • Locale content tailoring (DC vs. San Francisco) • Observation Medium (PC web browser vs. crawler) 2July 24, 2013 Arlington, Virginia Digital Preservation 2013
  • 3.
    The Normal Solution AdHoc Approaches • Variable Output • Deviate from standards (e.g., WARC) • Swell for Saving A Copy • Bad Practice for Preservation 3July 24, 2013 Arlington, Virginia Digital Preservation 2013 Archive Facebook
  • 4.
    Better Solution • Adaptinstitutional tools & mediums 4July 24, 2013 Arlington, Virginia Digital Preservation 2013
  • 5.
    MAKING THE TOOLSSUITABLE 5July 24, 2013 Arlington, Virginia Digital Preservation 2013
  • 6.
    Web Archiving IntegrationLayer (WAIL) • Packages Wayback, Heritrix and other preservation tools into a GUI • Tools are pre-configured to work together • “One Click User-Instigated Preservation” 6July 24, 2013 Arlington, Virginia Digital Preservation 2013
  • 7.
    Working with WAIL(Simple) 7 1. Enter URL 2. Click button • Come back later • Hit VIEW ARCHIVE July 24, 2013 Arlington, Virginia Digital Preservation 2013
  • 8.
    Working with WAIL(Custom) 8 • Enter multiple seed URLs (Heritrix tab) • Customize Crawl Parameters • Observe crawl state • Get included tool info • Get meta info on crawls July 24, 2013 Arlington, Virginia Digital Preservation 2013
  • 9.
    And More? • Otherpreservation tools packaged – (e.g., Archive Team’s WARC-Proxy) • GUI is extensible to facilitate further integration of other tools – Currently working to package UKWA’s WARC- Explorer, UKWA’smonitrix, ODU/LANL’smcurl, a custom memento proxy, etc. 9July 24, 2013 Arlington, Virginia Digital Preservation 2013
  • 10.
    PRESERVING IN THE ORIGINALCONTEXT 10July 24, 2013 Arlington, Virginia Digital Preservation 2013
  • 11.
    WARCreate Create WARC filesfrom any webpage • Preserves what you see instead of what crawler sees – Capture pages behind authentication – Manipulate then preserve • No more preservation delegation • Created WARCs compatible with WAIL and Wayback instance 11July 24, 2013 Arlington, Virginia Digital Preservation 2013 extension
  • 12.
    Ad hoc toGenerally Applicable 12 Archive Facebook WARCreate App Type Browser (Firefox) Browser (Chrome) Output Navigable Webpages Web ARCive (WARC) files Target Facebook.com Any website July 24, 2013 Arlington, Virginia Digital Preservation 2013
  • 13.
    Working with WARCreate 13 •Browse as usual • Preserve on a whim • WARC output to your Downloads folder July 24, 2013 Arlington, Virginia Digital Preservation 2013
  • 14.
    Preserving the OriginalContext 14 Facebook-Supplied Data Dump Archive created from WARCreate in Wayback July 24, 2013 Arlington, Virginia Digital Preservation 2013
  • 15.
    Preserving the OriginalContext 15 Using Scraping Tools (e.g. wget) Archive created from WARCreate in Wayback July 24, 2013 Arlington, Virginia Digital Preservation 2013
  • 16.
    Preserving the OriginalContext 16 A Crawler Has No Context Archive created from WARCreate in Wayback July 24, 2013 Arlington, Virginia Digital Preservation 2013
  • 17.
    Preserving the OriginalContext 17 IA/HERITRIX OBEY ROBOTS Archive created from WARCreate in Wayback July 24, 2013 Arlington, Virginia Digital Preservation 2013
  • 18.
    Preserving Beyond theSurface Web 18July 24, 2013 Arlington, Virginia Digital Preservation 2013
  • 19.
    Creating a WARCof Your Twitter Feed (Behind Authentication) 19July 24, 2013 Arlington, Virginia Digital Preservation 2013
  • 20.
    Tools’ History June 2012WARCreatepresented at Joint Conference on Digital Libraries (JCDL) ’12 * required XAMPP, “local server” July 2012WARCreate presented at Digital Preservation 2012 * NDSA/NDIIPP award for Future Steward February 2013 WARCreate decoupled from XAMPP, WAIL created, presented at Personal Digital Archiving 2013 May 2013 NEH grant begins to “Archive What I See Now”, port of WARCreate to Firefox & Much More July 2013WARCreate re-finalized, 1.0 released, presented at Digital Preservation 2013 21July 24, 2013 Arlington, Virginia Digital Preservation 2013
  • 21.
    Filling a Need •Capable tools prevent ad hoc archiving – Keep it familiar • WARCreate as Chrome extension – Or keep it native • WAIL has respective OS look-and-feel • Good Archiving practices only begin with content capture, much to do 22July 24, 2013 Arlington, Virginia Digital Preservation 2013
  • 22.
    Available Now! WARCreate.com matkelly.com/wail available for: availablefor: Web Archiving Integration Layer (WAIL) WARCreate bit.ly/digpres2013

Editor's Notes

  • #2 Introduce ,am here to speak about some of our efforts in building tools for casual archivists hoping to preserve web pages.
  • #3 First start with identifying problem:Digital preservation tools are ill suited for use by individual digital archivistsTools of focus, Htrix and Wayback, while FOSS, require technical know-how.To remedy, individuals can delegate the task of digpres to institutions but this poses many more problemsOne we have investigates are variances in perspective, as examplified by early crawls of Cragslist, which used GeoIP, and thus attached the saved content to the San Fran CL; Variance in perpective relative to tool used, i.e., what crawler sees may not be the same as what we want preserved
  • #4 Those that want to preserve resort to ad hoc techniquesThese techniques produced archives that may not stand test of time due to format issues and bad practice procedures used to save pagesEarly work for FB preservation (AFB) tried to remedy this by making the process consistent by saving all pages in one’s FB profile but was limited in scope, frequently broke due to FB redesigns
  • #5 We saw merit in putting preservation in the hands of those that decide what is important but wanted something:More general purpose – applicable to any webpageUsed standard formats like WARC andTook advantage of the tools that have already been created
  • #6 To conquer this last goal of adapting institutional tools to amateur archivists, we sought to adapt Heritrix, Wayback and other tools and make them more useable.
  • #7 Create WAILTook institutional toolsConfigured for relativityCoded up GUI to interact with toolsAllow crawls to be initiated and interacted with via GUIMade it easy: One Click User-Instigated Preservation
  • #8 Simple working for a one-off crawl:Enter URLHit the Archive Now buttonCheck back later
  • #9 Allow further capability likeservices managementCustom crawlCrawl status checkingAll still GUI-based