Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Archive What I See Now
Mat Kelly, Michael L. Nelson, Michele C. Weigle
Old Dominion University
{mkelly,mln,mweigle}@cs.odu...
What’s the Problem?
•
•
•
•

Web archives capture a lot but not everything
Individuals’ interests may not be captured
Time...
Timely Capture Is Important
Use Case: Capturing Breaking Stories

• Calls for seed URIs
are reactionary
• Not quick enough...
Timely Capture Is Important
Use Case: Capturing Breaking Stories

• Intermediate
mementos missed
• The story is
incomplete...
Timely Capture Is Important
Use Case: Capturing Breaking Stories

November 12, 2013
Salt Lake City, Utah

5
2013 Archive-I...
Timely Capture Is Important
Use Case: Capturing Breaking Stories

November 12, 2013
Salt Lake City, Utah

6
2013 Archive-I...
The Amateur Archivist’s Approach
to Just-In-Time capture
• Users take ad hoc approaches
1. Screenshots of Pages

2. Other ...
Enabling The Amateur
Web Archivist
• Acknowledge the problem:
– THE TOOLS ARE DIFFICULT!

• Resolve the problem:
– Build m...
The Institutional Dilemma
• Safety of Archives Requires $
• Institutions Require Funding
• Users’ Hard Drives Fail
– No Ac...
So we built it!
WARCreate – Google Chrome extension
• Create web archives from browser
• Capture personalized content
• Pr...
WARCreate – How it Works

November 12, 2013
Salt Lake City, Utah

11
2013 Archive-It Partner Meeting
Preserving the Original Context
Use Case: Capturing Facebook
Archive created from
WARCreate in Wayback

Facebook-Supplied ...
Preserving the Original Context
Use Case: Capturing Facebook
Using Scraping Tools (e.g. wget)

Archive created from
WARCre...
Preserving the Original Context
Use Case: Capturing Facebook
Archive created from
WARCreate in Wayback

A Crawler Has No C...
Preserving the Original Context
Use Case: Capturing Facebook
Archive created from
WARCreate in Wayback

IA/HERITRIX OBEY R...
So we built it!
WARCreate – Google Chrome extension
• Create web archives from browser
• Capture personalized content
• Pr...
Users can now create WARCs!
WARCreate – Google Chrome extension
• Create web archives from browser
• Capture personalized ...
So, again, we built it!
Web Archiving Integration Layer (WAIL)
• Heritrix, Wayback, etc. packaged for PC
• GUI front-end a...
So, again, we built it!
Web Archiving Integration Layer (WAIL)
• Heritrix, Wayback, etc. packaged for PC
• GUI front-end a...
The

Archive What I See Now
Project

November 12, 2013
Salt Lake City, Utah

20
2013 Archive-It Partner Meeting
The Archive What I See Now Project:
Three Goals
1. Port
2. Add functionality in:
…
to upload WARCs to:

&
&

3. Implement ...
Porting WARCreate to Firefox
• Disjoint extension/add-on APIs
– Little logic can be re-used

• Problems with HTTP header c...
The Archive What I See Now Project:
Three Goals

✓ In βeta now!

1. Port
2. Add functionality in:
…
to upload WARCs to:

&...
The Archive What I See Now Project:
Three Goals
1. Port
2. Add functionality in:
…
to upload WARCs to:

&
&

3. Implement ...
Uploading WARCs:
An Open Question
• Working with Archive-It to determine
feasibility of user-provided WARCs
• Consideratio...
The Archive What I See Now Project:
Three Goals
1. Port
2. Add functionality in:
…
to upload WARCs to:

&
&

3. Implement ...
Sequential Archiving?
• Similar to a focused crawl but URIs defined on
per-site basis to be comprehensive
– Akin to

but g...
Online Hierarchy Definition
• Only (and optionally) applied on recognized sites
– scraping as fallback for establishing hi...
Summary
• Firefox WARCreate in Beta
– Chrome WARCreate Users Can Currently
Archive What They See Now with
&

• Sequential ...
Archive What I See Now
• Download Our Archiving Tools!
Web Archiving Integration Layer (WAIL)
http://matkelly.com/WAIL
One...
Upcoming SlideShare
Loading in …5
×

0

Share

Download to read offline

Archive What I See Now - Archive-It Partner Meeting 2013 2013

Download to read offline

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Archive What I See Now - Archive-It Partner Meeting 2013 2013

  1. 1. Archive What I See Now Mat Kelly, Michael L. Nelson, Michele C. Weigle Old Dominion University {mkelly,mln,mweigle}@cs.odu.edu Web Science and Digital Libraries Research Group ws-dl.blogspot.com
  2. 2. What’s the Problem? • • • • Web archives capture a lot but not everything Individuals’ interests may not be captured Timely capture is important Capture capability must be enabled for all November 12, 2013 Salt Lake City, Utah 2 2013 Archive-It Partner Meeting
  3. 3. Timely Capture Is Important Use Case: Capturing Breaking Stories • Calls for seed URIs are reactionary • Not quick enough for rapidly evolving events November 12, 2013 Salt Lake City, Utah 3 2013 Archive-It Partner Meeting
  4. 4. Timely Capture Is Important Use Case: Capturing Breaking Stories • Intermediate mementos missed • The story is incomplete November 12, 2013 Salt Lake City, Utah 4 2013 Archive-It Partner Meeting
  5. 5. Timely Capture Is Important Use Case: Capturing Breaking Stories November 12, 2013 Salt Lake City, Utah 5 2013 Archive-It Partner Meeting
  6. 6. Timely Capture Is Important Use Case: Capturing Breaking Stories November 12, 2013 Salt Lake City, Utah 6 2013 Archive-It Partner Meeting
  7. 7. The Amateur Archivist’s Approach to Just-In-Time capture • Users take ad hoc approaches 1. Screenshots of Pages 2. Other sub-optimal approaches November 12, 2013 Salt Lake City, Utah 7 2013 Archive-It Partner Meeting
  8. 8. Enabling The Amateur Web Archivist • Acknowledge the problem: – THE TOOLS ARE DIFFICULT! • Resolve the problem: – Build more accessible tools (make it EASY) – Appeal to standards (e.g., WARC) – Make interoperable November 12, 2013 Salt Lake City, Utah 28500:2009 8 2013 Archive-It Partner Meeting
  9. 9. The Institutional Dilemma • Safety of Archives Requires $ • Institutions Require Funding • Users’ Hard Drives Fail – No Access to Save-As files and Screenshots • Hybrid approach needed – Leverage institutional safety, formats, and tech – allow direct user deposits November 12, 2013 Salt Lake City, Utah 9 2013 Archive-It Partner Meeting
  10. 10. So we built it! WARCreate – Google Chrome extension • Create web archives from browser • Capture personalized content • Preserve on a whim 1. 2. Mat Kelly and Michele C., "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage," In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2012). Washington, DC, June 2012, pp. 437-438 Mat Kelly, Michele C. Weigle , Michael Nelson. "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage," Digital Preservation 2012, Tools Demo Session: Web Archiving; 2012 Jul 25; Washington, DC. November 12, 2013 Salt Lake City, Utah 10 2013 Archive-It Partner Meeting
  11. 11. WARCreate – How it Works November 12, 2013 Salt Lake City, Utah 11 2013 Archive-It Partner Meeting
  12. 12. Preserving the Original Context Use Case: Capturing Facebook Archive created from WARCreate in Wayback Facebook-Supplied Data Dump Liberated Data Doesn’t Give The Whole Picture November 12, 2013 Salt Lake City, Utah 12 2013 Archive-It Partner Meeting
  13. 13. Preserving the Original Context Use Case: Capturing Facebook Using Scraping Tools (e.g. wget) Archive created from WARCreate in Wayback The Target Controls What is Allowed November 12, 2013 Salt Lake City, Utah 13 2013 Archive-It Partner Meeting
  14. 14. Preserving the Original Context Use Case: Capturing Facebook Archive created from WARCreate in Wayback A Crawler Has No Context No Credentials  No Entry  No Archiving November 12, 2013 Salt Lake City, Utah 14 2013 Archive-It Partner Meeting
  15. 15. Preserving the Original Context Use Case: Capturing Facebook Archive created from WARCreate in Wayback IA/HERITRIX OBEY ROBOTS No Means No, if They Say and you Obey November 12, 2013 Salt Lake City, Utah 15 2013 Archive-It Partner Meeting
  16. 16. So we built it! WARCreate – Google Chrome extension • Create web archives from browser • Capture personalized content • Preserve on a whim November 12, 2013 Salt Lake City, Utah 16 2013 Archive-It Partner Meeting
  17. 17. Users can now create WARCs! WARCreate – Google Chrome extension • Create web archives from browser • Capture personalized content • Preserve on a whim Users don’t know WHAT TO DO with WARC files November 12, 2013 Salt Lake City, Utah 17 2013 Archive-It Partner Meeting
  18. 18. So, again, we built it! Web Archiving Integration Layer (WAIL) • Heritrix, Wayback, etc. packaged for PC • GUI front-end allows “One-Click Preservation” • Provides means to replay WARCs 1. 2. November 12, 2013 Salt Lake City, Utah Mat Kelly, Michele C. Weigle, Michael Nelson. "Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving," Personal Digital Archiving 2013, Poster Session; 2013 Feb 21; College Park, MD. Mat Kelly, Michael Nelson and Michele C. Weigle. "WARCreate and WAIL: WARC, Wayback and Heritrix Made Easy," Digital Preservation 2013, Workshops and Sessions: Web Archiving; 2013 Jul 24; Alexandria, VA 18 2013 Archive-It Partner Meeting
  19. 19. So, again, we built it! Web Archiving Integration Layer (WAIL) • Heritrix, Wayback, etc. packaged for PC • GUI front-end allows “One-Click Preservation” • Provides means to replay WARCs November 12, 2013 Salt Lake City, Utah 19 2013 Archive-It Partner Meeting
  20. 20. The Archive What I See Now Project November 12, 2013 Salt Lake City, Utah 20 2013 Archive-It Partner Meeting
  21. 21. The Archive What I See Now Project: Three Goals 1. Port 2. Add functionality in: … to upload WARCs to: & & 3. Implement Sequential Archiving November 12, 2013 Salt Lake City, Utah 21 2013 Archive-It Partner Meeting
  22. 22. Porting WARCreate to Firefox • Disjoint extension/add-on APIs – Little logic can be re-used • Problems with HTTP header capture in Chrome are trivial in Firefox – Chrome = highly asynchronous fetching • Code to save WARC to PC from browser reusable in Firefox November 12, 2013 Salt Lake City, Utah 22 2013 Archive-It Partner Meeting
  23. 23. The Archive What I See Now Project: Three Goals ✓ In βeta now! 1. Port 2. Add functionality in: … to upload WARCs to: & & 3. Implement Sequential Archiving November 12, 2013 Salt Lake City, Utah 23 2013 Archive-It Partner Meeting
  24. 24. The Archive What I See Now Project: Three Goals 1. Port 2. Add functionality in: … to upload WARCs to: & & 3. Implement Sequential Archiving November 12, 2013 Salt Lake City, Utah 24 2013 Archive-It Partner Meeting
  25. 25. Uploading WARCs: An Open Question • Working with Archive-It to determine feasibility of user-provided WARCs • Consideration of data integrity • Should data be merged with A-IT crawled WARCs? – How do we account for your www.facebook.com vs. my www.facebook.com • Privacy? November 12, 2013 Salt Lake City, Utah 25 2013 Archive-It Partner Meeting
  26. 26. The Archive What I See Now Project: Three Goals 1. Port 2. Add functionality in: … to upload WARCs to: & & 3. Implement Sequential Archiving November 12, 2013 Salt Lake City, Utah 26 2013 Archive-It Partner Meeting
  27. 27. Sequential Archiving? • Similar to a focused crawl but URIs defined on per-site basis to be comprehensive – Akin to but generalized • Implemented into WARCreate • Utilize per-site specification to keep tools from breaking★ personal stream my tweets news feed streams followees’ tweets multimedia-photos photos photos N/A multimedia-videos videos videos N/A photo collection albums N/A N/A posts notes N/A N/A friends November 12, 2013 Salt Lake City, Utah posts global stream Discovery & Scraping: The Information Retrieval Approach - versus The Digital Libraries Approach★ wall friends circles following 27 2013 Archive-It Partner Meeting
  28. 28. Online Hierarchy Definition • Only (and optionally) applied on recognized sites – scraping as fallback for establishing hierarchy • Not limited to social media – CNN.com, MSNBC.com, etc have similar hierarchies • Lives online, tools allude to and are always updated • Standardized spec* prototype is live online * M. Kelly, An Extensible Framework for Creating Personal Archives of Web Resources Requiring Authentication, Aug 2012 November 12, 2013 Salt Lake City, Utah 28 2013 Archive-It Partner Meeting
  29. 29. Summary • Firefox WARCreate in Beta – Chrome WARCreate Users Can Currently Archive What They See Now with & • Sequential Archiving Implemented in Chrome WARCreate, needs porting • Next Big Hurdle: Working with Archive-It in WARC upload logistics November 12, 2013 Salt Lake City, Utah 29 2013 Archive-It Partner Meeting
  30. 30. Archive What I See Now • Download Our Archiving Tools! Web Archiving Integration Layer (WAIL) http://matkelly.com/WAIL One-Click Preservation Heritrix, Wayback and Others On Your PC! WARCreate for Chrome http://WARCreate.com Create WARC files form any web page from your browser • Share Your Use Cases for Capturing the Unpreserved and the Unpreservable • Help Us Improve Our Tools, Give Feedback! http://bit.ly/wc-wail November 12, 2013 Salt Lake City, Utah version in beta Available Soon! 30 2013 Archive-It Partner Meeting

Views

Total views

2,888

On Slideshare

0

From embeds

0

Number of embeds

1,610

Actions

Downloads

6

Shares

0

Comments

0

Likes

0

×