Future of web archiving
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Future of web archiving

on

  • 68 views

 

Statistics

Views

Total Views
68
Views on SlideShare
68
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Checklist, https://www.flickr.com/photos/adesigna/4090782772
  • First of all, why is web archiving important? <br /> As members of memory institutions, it is the continuation in a new technological context of our longstanding mission and obligation to collect, preserve, and provide access to the scholar record and our collective cultural heritage. <br /> Since the web is where the content is, that is where we have to go to acquire it. <br /> But the fundamental problem is that the web is not web. <br /> As soon as you think you have quantified or characterized it, it has changed into something else; and as soon as you have processes in place to capture web content, the content is not available in the same way. <br /> <br /> What a tangled web we weave, https://www.flickr.com/photos/alaig/3522953697 <br /> Thorsten Hartmann, Untitles, https://www.flickr.com/photos/hier_gibt_es_nichts_zu_sehen_bitte_gehen_sie_weiter/840587382 <br />
  • It’s different than what anyone – Tim Berners-Lee included – had in mind 25 years ago <br /> The web is no longer giant document retrieval system, but a programming environment <br /> The browser is no longer a document view, but a general purpose virtual machine; its fundamental language is no longer HTML but JavaScript. <br /> The mode of experience has shifted from a common to a highly personalized one; whose web are we archiving? <br /> <br /> Crumbled paper, https://www.flickr.com/photos/84564583@N08/11167321155 <br /> The great pyramid: Size matters, https://www.flickr.com/photos/swamibu/2223726960 <br /> A pile of rocks, https://www.flickr.com/photos/sharples/79222765
  • Paywalls, robot exclusions, crawler traps, … What we need is a collection mechanism that acts like a person <br /> Ben Husmann, The FREE HUGS robot says "I am here for you“, https://www.flickr.com/photos/benhusmann/5126030385 <br /> <br /> Event-driven content doesn’t mesh well with established – meaning v-e-r-y deliberate – collection development processes <br /> Search is simple if you know the URL
  • Event-driven content doesn’t mesh well with established – meaning v-e-r-y deliberate – collection development processes <br /> Hossam el-Hamalawy, Tahrir Square, https://www.flickr.com/photos/elhamalawy/6378330927
  • U Can’t Touch This, https://www.flickr.com/photos/vblibrary/7414544704
  • Dan Storey, Square peg in a round hole, https://www.flickr.com/photos/21664580@N04/2095574414
  • Silos, https://www.flickr.com/photos/54159370@N08/7148880783
  • Paywalls, robot exclusions, crawler traps, … <br /> Event-driven content doesn’t mesh well with established – meaning v-e-r-y deliberate – collection development processes <br /> Search is simple if you know the URL <br /> How to find enough good people? (We’re hiring!)
  • “You’re collecting that?” <br /> May need programmatic or API access to in situ collection analysis
  • Headless browsers (PhantomJS, Umbra, etc.), API harvesters <br /> Make browsing the past web as simple and intuitive as browsing the live web <br /> Net casting at disk Contarf Pelican Park, https://www.flickr.com/photos/shebalso/6357626617 <br /> Bart van de Biezen, Goed Zoekveld, https://www.flickr.com/photos/bartelomeus/4184705426
  • Avoid needless duplication of effort <br /> As librarians we have historically given perhaps inordinate priority to content creators and curators and not enough to consumers. But over significant timespans it is the users who affirmatively seek out and exploit content who may be best positioned to contribute towards its successful management. <br /> <br /> Meyer lemons, https://www.flickr.com/photos/chiotsrun/4115059294 <br /> We sit in the shade and drink lemonade, https://www.flickr.com/photos/sagesolar/9230445157 <br />
  • Michael Harries, Drawing back the curtain, http://cdn.ws.citrix.com/wp-content/uploads/2012/05/iStock_000010348904XSmall.jpg

Future of web archiving Presentation Transcript

  • 1. Future of Web Archiving Stephen Abrams California Digital Library Martin Klein Los Alamos National Laboratory Jimmy Lin University of Maryland Michael Nelson Old Dominion University Digital Preservation 2014, Washington, July 22-24
  • 2. www.flickr.com/photos/adesigna/4090782772 Agenda Web archiving problems and opportunities Memento tools WarcBase platform Assessing quality of archives Discussion Agenda  Web archiving problems and opportunities  Memento tools  WarcBase platform  Assessing quality of archives  Discussion
  • 3. Web archiving is important but (really) hard  Why web archiving? Continuation of longstanding mission to collect, preserve, and provide access to the scholarly record and our cultural heritage Publishing/dissemination platform of choice  But … www.flickr.com/photos/alaig/3522953697 www.flickr.com/photos/hier_gibt_es_nichts_zu_sehen_bitte_gehen_sie_weiter/840587382 the web isn’t the web anymore
  • 4. Web in transition Document retrieval Document viewer HTML Common Desktop Information Programming environment Virtual machine JavaScript Personalized Mobile/handheld/wearable Things www.flickr.com/photos/swamibu/2223726960 www.flickr.com/photos/sharples/79222765 A “web” of notes with links (like references) between them …” – Tim Berners-Lee, March 1989
  • 5. (Some) other issues  Crawlers don’t act like browsers ► Need robots that act more like people www.flickr.com/photos/benhusmann/5126030385
  • 6. (Some) other issues  Crawlers don’t act like browsers  Responsiveness to time-sensitive content ► Need to bypass v-e-r-y deliberate collection development procedures Gaurdian News and Media Limited
  • 7. www.flickr.com/photos/vblibrary/7414544704 (Some) other issues  Crawlers don’t act like browsers  Responsiveness to time-sensitive content  Policies, rights, and permissions ► Need to overcome legal barriers that follow the monetization of content
  • 8. www.flickr.com/photos/21664580@N04/2095574414 into traditional management (Some) other issues  Crawlers don’t act like browsers  Responsiveness to time-sensitive content  Policies, rights, and permissions  Difficult integration into traditional management and discovery services ► Leading to …
  • 9. (Some) other issues  Crawlers don’t act like browsers  Responsiveness to time-sensitive content  Policies, rights, and permissions  Difficult integration into traditional management and discovery services  Siloed collections www.flickr.com/photos/54159370@N08/7148880783
  • 10. (Some) other issues  Crawlers don’t act like browsers  Responsiveness to time-sensitive content  Policies, rights, and permissions  Difficult integration into traditional management and discovery services  Siloed collections  Scale ► Storage capacity ► Full-text indexing ► De-duplication ► Resources Raiders of the Lost Ark © Paramount Pictures
  • 11. Supporting research  Little awareness in the scholarly community  Poorly understood use cases  Few tools  Traditional find→download→manipulate locally workflows may not be feasible at web scale ► Need APIs and business models for in situ analysis berkeley.edu/teach www.flickr.com/photos/infocux/8450190120
  • 12. www.flickr.com/photos/bartelomeus/4184705426 Browsing the past should be as simple and intuitive as the now Better discovery modalities www.flickr.com/photos/shebalso/6357626617 mechanisms Technological opportunities  Better capture mechanisms ► Headless browsers ► API harvesters …  Better discovery modalities ► Browsing the past should be as simple and intuitive as the now …
  • 13. Cooperative opportunities  Complementary collection development  Coordinated infrastructure support and operation ► Or perhaps centralized – a HathiTrust for web archives?  Crowd sourcing selection, description, quality assurance www.flickr.com/photos/chiotsrun/4115059294 www.flickr.com/photos/sagesolar/9230445157
  • 14. And now … cdn.ws.citrix.com/wp-content/uploads/2012/05/iStock_000010348904XSmall.jpg