Archive What I See Now
Mat Kelly, Michael L. Nelson, Michele C. Weigle
Old Dominion University
{mkelly,mln,mweigle}@cs.odu...
What’s the Problem?
•
•
•
•

Web archives capture a lot but not everything
Individuals’ interests may not be captured
Time...
Timely Capture Is Important
Use Case: Capturing Breaking Stories

• Calls for seed URIs
are reactionary
• Not quick enough...
Timely Capture Is Important
Use Case: Capturing Breaking Stories

• Intermediate
mementos missed
• The story is
incomplete...
Timely Capture Is Important
Use Case: Capturing Breaking Stories

November 12, 2013
Salt Lake City, Utah

5
2013 Archive-I...
Timely Capture Is Important
Use Case: Capturing Breaking Stories

November 12, 2013
Salt Lake City, Utah

6
2013 Archive-I...
The Amateur Archivist’s Approach
to Just-In-Time capture
• Users take ad hoc approaches
1. Screenshots of Pages

2. Other ...
Enabling The Amateur
Web Archivist
• Acknowledge the problem:
– THE TOOLS ARE DIFFICULT!

• Resolve the problem:
– Build m...
The Institutional Dilemma
• Safety of Archives Requires $
• Institutions Require Funding
• Users’ Hard Drives Fail
– No Ac...
So we built it!
WARCreate – Google Chrome extension
• Create web archives from browser
• Capture personalized content
• Pr...
WARCreate – How it Works

November 12, 2013
Salt Lake City, Utah

11
2013 Archive-It Partner Meeting
Preserving the Original Context
Use Case: Capturing Facebook
Archive created from
WARCreate in Wayback

Facebook-Supplied ...
Preserving the Original Context
Use Case: Capturing Facebook
Using Scraping Tools (e.g. wget)

Archive created from
WARCre...
Preserving the Original Context
Use Case: Capturing Facebook
Archive created from
WARCreate in Wayback

A Crawler Has No C...
Preserving the Original Context
Use Case: Capturing Facebook
Archive created from
WARCreate in Wayback

IA/HERITRIX OBEY R...
So we built it!
WARCreate – Google Chrome extension
• Create web archives from browser
• Capture personalized content
• Pr...
Users can now create WARCs!
WARCreate – Google Chrome extension
• Create web archives from browser
• Capture personalized ...
So, again, we built it!
Web Archiving Integration Layer (WAIL)
• Heritrix, Wayback, etc. packaged for PC
• GUI front-end a...
So, again, we built it!
Web Archiving Integration Layer (WAIL)
• Heritrix, Wayback, etc. packaged for PC
• GUI front-end a...
The

Archive What I See Now
Project

November 12, 2013
Salt Lake City, Utah

20
2013 Archive-It Partner Meeting
The Archive What I See Now Project:
Three Goals
1. Port
2. Add functionality in:
…
to upload WARCs to:

&
&

3. Implement ...
Porting WARCreate to Firefox
• Disjoint extension/add-on APIs
– Little logic can be re-used

• Problems with HTTP header c...
The Archive What I See Now Project:
Three Goals

✓ In βeta now!

1. Port
2. Add functionality in:
…
to upload WARCs to:

&...
The Archive What I See Now Project:
Three Goals
1. Port
2. Add functionality in:
…
to upload WARCs to:

&
&

3. Implement ...
Uploading WARCs:
An Open Question
• Working with Archive-It to determine
feasibility of user-provided WARCs
• Consideratio...
The Archive What I See Now Project:
Three Goals
1. Port
2. Add functionality in:
…
to upload WARCs to:

&
&

3. Implement ...
Sequential Archiving?
• Similar to a focused crawl but URIs defined on
per-site basis to be comprehensive
– Akin to

but g...
Online Hierarchy Definition
• Only (and optionally) applied on recognized sites
– scraping as fallback for establishing hi...
Summary
• Firefox WARCreate in Beta
– Chrome WARCreate Users Can Currently
Archive What They See Now with
&

• Sequential ...
Archive What I See Now
• Download Our Archiving Tools!
Web Archiving Integration Layer (WAIL)
http://matkelly.com/WAIL
One...
Upcoming SlideShare
Loading in …5
×

Archive What I See Now - Archive-It Partner Meeting 2013 2013

2,433 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,433
On SlideShare
0
From Embeds
0
Number of Embeds
1,411
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Archive What I See Now - Archive-It Partner Meeting 2013 2013

  1. 1. Archive What I See Now Mat Kelly, Michael L. Nelson, Michele C. Weigle Old Dominion University {mkelly,mln,mweigle}@cs.odu.edu Web Science and Digital Libraries Research Group ws-dl.blogspot.com
  2. 2. What’s the Problem? • • • • Web archives capture a lot but not everything Individuals’ interests may not be captured Timely capture is important Capture capability must be enabled for all November 12, 2013 Salt Lake City, Utah 2 2013 Archive-It Partner Meeting
  3. 3. Timely Capture Is Important Use Case: Capturing Breaking Stories • Calls for seed URIs are reactionary • Not quick enough for rapidly evolving events November 12, 2013 Salt Lake City, Utah 3 2013 Archive-It Partner Meeting
  4. 4. Timely Capture Is Important Use Case: Capturing Breaking Stories • Intermediate mementos missed • The story is incomplete November 12, 2013 Salt Lake City, Utah 4 2013 Archive-It Partner Meeting
  5. 5. Timely Capture Is Important Use Case: Capturing Breaking Stories November 12, 2013 Salt Lake City, Utah 5 2013 Archive-It Partner Meeting
  6. 6. Timely Capture Is Important Use Case: Capturing Breaking Stories November 12, 2013 Salt Lake City, Utah 6 2013 Archive-It Partner Meeting
  7. 7. The Amateur Archivist’s Approach to Just-In-Time capture • Users take ad hoc approaches 1. Screenshots of Pages 2. Other sub-optimal approaches November 12, 2013 Salt Lake City, Utah 7 2013 Archive-It Partner Meeting
  8. 8. Enabling The Amateur Web Archivist • Acknowledge the problem: – THE TOOLS ARE DIFFICULT! • Resolve the problem: – Build more accessible tools (make it EASY) – Appeal to standards (e.g., WARC) – Make interoperable November 12, 2013 Salt Lake City, Utah 28500:2009 8 2013 Archive-It Partner Meeting
  9. 9. The Institutional Dilemma • Safety of Archives Requires $ • Institutions Require Funding • Users’ Hard Drives Fail – No Access to Save-As files and Screenshots • Hybrid approach needed – Leverage institutional safety, formats, and tech – allow direct user deposits November 12, 2013 Salt Lake City, Utah 9 2013 Archive-It Partner Meeting
  10. 10. So we built it! WARCreate – Google Chrome extension • Create web archives from browser • Capture personalized content • Preserve on a whim 1. 2. Mat Kelly and Michele C., "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage," In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2012). Washington, DC, June 2012, pp. 437-438 Mat Kelly, Michele C. Weigle , Michael Nelson. "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage," Digital Preservation 2012, Tools Demo Session: Web Archiving; 2012 Jul 25; Washington, DC. November 12, 2013 Salt Lake City, Utah 10 2013 Archive-It Partner Meeting
  11. 11. WARCreate – How it Works November 12, 2013 Salt Lake City, Utah 11 2013 Archive-It Partner Meeting
  12. 12. Preserving the Original Context Use Case: Capturing Facebook Archive created from WARCreate in Wayback Facebook-Supplied Data Dump Liberated Data Doesn’t Give The Whole Picture November 12, 2013 Salt Lake City, Utah 12 2013 Archive-It Partner Meeting
  13. 13. Preserving the Original Context Use Case: Capturing Facebook Using Scraping Tools (e.g. wget) Archive created from WARCreate in Wayback The Target Controls What is Allowed November 12, 2013 Salt Lake City, Utah 13 2013 Archive-It Partner Meeting
  14. 14. Preserving the Original Context Use Case: Capturing Facebook Archive created from WARCreate in Wayback A Crawler Has No Context No Credentials  No Entry  No Archiving November 12, 2013 Salt Lake City, Utah 14 2013 Archive-It Partner Meeting
  15. 15. Preserving the Original Context Use Case: Capturing Facebook Archive created from WARCreate in Wayback IA/HERITRIX OBEY ROBOTS No Means No, if They Say and you Obey November 12, 2013 Salt Lake City, Utah 15 2013 Archive-It Partner Meeting
  16. 16. So we built it! WARCreate – Google Chrome extension • Create web archives from browser • Capture personalized content • Preserve on a whim November 12, 2013 Salt Lake City, Utah 16 2013 Archive-It Partner Meeting
  17. 17. Users can now create WARCs! WARCreate – Google Chrome extension • Create web archives from browser • Capture personalized content • Preserve on a whim Users don’t know WHAT TO DO with WARC files November 12, 2013 Salt Lake City, Utah 17 2013 Archive-It Partner Meeting
  18. 18. So, again, we built it! Web Archiving Integration Layer (WAIL) • Heritrix, Wayback, etc. packaged for PC • GUI front-end allows “One-Click Preservation” • Provides means to replay WARCs 1. 2. November 12, 2013 Salt Lake City, Utah Mat Kelly, Michele C. Weigle, Michael Nelson. "Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving," Personal Digital Archiving 2013, Poster Session; 2013 Feb 21; College Park, MD. Mat Kelly, Michael Nelson and Michele C. Weigle. "WARCreate and WAIL: WARC, Wayback and Heritrix Made Easy," Digital Preservation 2013, Workshops and Sessions: Web Archiving; 2013 Jul 24; Alexandria, VA 18 2013 Archive-It Partner Meeting
  19. 19. So, again, we built it! Web Archiving Integration Layer (WAIL) • Heritrix, Wayback, etc. packaged for PC • GUI front-end allows “One-Click Preservation” • Provides means to replay WARCs November 12, 2013 Salt Lake City, Utah 19 2013 Archive-It Partner Meeting
  20. 20. The Archive What I See Now Project November 12, 2013 Salt Lake City, Utah 20 2013 Archive-It Partner Meeting
  21. 21. The Archive What I See Now Project: Three Goals 1. Port 2. Add functionality in: … to upload WARCs to: & & 3. Implement Sequential Archiving November 12, 2013 Salt Lake City, Utah 21 2013 Archive-It Partner Meeting
  22. 22. Porting WARCreate to Firefox • Disjoint extension/add-on APIs – Little logic can be re-used • Problems with HTTP header capture in Chrome are trivial in Firefox – Chrome = highly asynchronous fetching • Code to save WARC to PC from browser reusable in Firefox November 12, 2013 Salt Lake City, Utah 22 2013 Archive-It Partner Meeting
  23. 23. The Archive What I See Now Project: Three Goals ✓ In βeta now! 1. Port 2. Add functionality in: … to upload WARCs to: & & 3. Implement Sequential Archiving November 12, 2013 Salt Lake City, Utah 23 2013 Archive-It Partner Meeting
  24. 24. The Archive What I See Now Project: Three Goals 1. Port 2. Add functionality in: … to upload WARCs to: & & 3. Implement Sequential Archiving November 12, 2013 Salt Lake City, Utah 24 2013 Archive-It Partner Meeting
  25. 25. Uploading WARCs: An Open Question • Working with Archive-It to determine feasibility of user-provided WARCs • Consideration of data integrity • Should data be merged with A-IT crawled WARCs? – How do we account for your www.facebook.com vs. my www.facebook.com • Privacy? November 12, 2013 Salt Lake City, Utah 25 2013 Archive-It Partner Meeting
  26. 26. The Archive What I See Now Project: Three Goals 1. Port 2. Add functionality in: … to upload WARCs to: & & 3. Implement Sequential Archiving November 12, 2013 Salt Lake City, Utah 26 2013 Archive-It Partner Meeting
  27. 27. Sequential Archiving? • Similar to a focused crawl but URIs defined on per-site basis to be comprehensive – Akin to but generalized • Implemented into WARCreate • Utilize per-site specification to keep tools from breaking★ personal stream my tweets news feed streams followees’ tweets multimedia-photos photos photos N/A multimedia-videos videos videos N/A photo collection albums N/A N/A posts notes N/A N/A friends November 12, 2013 Salt Lake City, Utah posts global stream Discovery & Scraping: The Information Retrieval Approach - versus The Digital Libraries Approach★ wall friends circles following 27 2013 Archive-It Partner Meeting
  28. 28. Online Hierarchy Definition • Only (and optionally) applied on recognized sites – scraping as fallback for establishing hierarchy • Not limited to social media – CNN.com, MSNBC.com, etc have similar hierarchies • Lives online, tools allude to and are always updated • Standardized spec* prototype is live online * M. Kelly, An Extensible Framework for Creating Personal Archives of Web Resources Requiring Authentication, Aug 2012 November 12, 2013 Salt Lake City, Utah 28 2013 Archive-It Partner Meeting
  29. 29. Summary • Firefox WARCreate in Beta – Chrome WARCreate Users Can Currently Archive What They See Now with & • Sequential Archiving Implemented in Chrome WARCreate, needs porting • Next Big Hurdle: Working with Archive-It in WARC upload logistics November 12, 2013 Salt Lake City, Utah 29 2013 Archive-It Partner Meeting
  30. 30. Archive What I See Now • Download Our Archiving Tools! Web Archiving Integration Layer (WAIL) http://matkelly.com/WAIL One-Click Preservation Heritrix, Wayback and Others On Your PC! WARCreate for Chrome http://WARCreate.com Create WARC files form any web page from your browser • Share Your Use Cases for Capturing the Unpreserved and the Unpreservable • Help Us Improve Our Tools, Give Feedback! http://bit.ly/wc-wail November 12, 2013 Salt Lake City, Utah version in beta Available Soon! 30 2013 Archive-It Partner Meeting

×