Using Wayback Machine for Research
Upcoming SlideShare
Loading in...5
×
 

Using Wayback Machine for Research

on

  • 1,339 views

Presentation given at the Library of Congress on how to use Wayback Machine more effectively to answer historical research questions.

Presentation given at the Library of Congress on how to use Wayback Machine more effectively to answer historical research questions.

Statistics

Views

Total Views
1,339
Views on SlideShare
1,339
Embed Views
0

Actions

Likes
0
Downloads
7
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Mr. Peabody and Sherman’s time machine plot device from the television show “Rocky & Bullwinkle.”
  • The Wayback Machine most people are familiar with.
  • http://www.collectionscanada.gc.ca/webarchives/20071114183551/http://www.accord-treaty.gc.ca/main.asp?language=0
  • http://www.collectionscanada.gc.ca/webarchives/*/http://www.accord-treaty.gc.ca/main.asp?language=0
  • http://www.arquivo.pt/wayback/wayback/id4390263index3?l=en
  • http://was.nl.sg/wayback/20080404151626/http://www.biosingapore.org.sg/
  • http://was.nl.sg/wayback/*/http://www.biosingapore.org.sg/
  • http://www.padi.cat:8080/wayback/20120327044230/http://www.udg.edu/
  • http://www.padi.cat:8080/wayback/*/http://www.udg.edu/
  • http://webarchives.cdlib.org/sw16689n33/http://bawsca.org/
  • http://wax.lib.harvard.edu/collections/wayback.do?stamp=20080714184732&lang=eng&primColl=61&seed=175&liveWebUrl=tiffanni.blogspot.com%2F
  • When the Twitter link in the footer is clicked…
  • …the AJAX code truncates the URL, resulting in a blank page.
  • If you disable JavaScript in the browser and then click on the Twitter link, the page loads fine.
  • The navigation menu layout is awry and the links aren’t clickable.
  • Just because Wayback can’t properly rewrite the link doesn’t mean the crawler didn’t capture it. Navigate to the live site.
  • Find the desired URL.
  • Append the desired URL to the Wayback URL.
  • In the Library of Congress Web Archives, it’s only possible to search the bibliographic records.
  • The British Library and Internet Archive are exploring Lucene/Solr for full-text searching of web archives.
  • Note the live site URL.
  • Appending the live site URL to the Wayback URL takes you to a “snapshot” of that page in the archive.
  • Full date range is wildcarded (any date), so all snapshots for that URL are presented.
  • Date range is wildcarded to include only those captures from the specified year.
  • An individual page in the archive.
  • The time and specific resource are wildcarded, so it shows all resources captured for the specified domain on the specified day.
  • An example of one of the captured resources in the list.
  • Example of a live site.
  • Adjust the slider to request a Memento (i.e. archived resource) for the current URL.
  • We know that the website existed before then; how do we find it?
  • Copy the link to the IT Dashboard.
  • Additional captures from 2009 and 2010 are presented in the archive.
  • Additional captures from 2009 are presented in the archive.
  • The teleconference archives are in the events section.
  • If you click on any of the individual calls…
  • …you’re taken to an authentication page.
  • Even though the site URLs changed, there’s a decent chance that the teleconference archives were previously located in the events section.
  • Sure enough, they’re there, and not password-protected.
  • http://eotarchive.cdlib.org/2012.html
  • http://eotarchive.cdlib.org/search?browse-all=yes
  • http://govinfo.library.unt.edu/
  • http://lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html

Using Wayback Machine for Research Using Wayback Machine for Research Presentation Transcript

  • Using Wayback Machine for Research Nicholas Taylor Repository Development Group
  • What Is theWAYBACK MACHINE?
  • WABAC Machine?
  • Internet Archive’s Wayback Machine
  • not one, but many Wayback Machines open source software to “replay” web archives  rewrites links to point to archived resources  allows for temporal navigation within archive used by many web archiving institutions  33 out of 62 initiatives listed on Wikipedia
  • Government of Canada Web Archive
  • Government of Canada Web Archive
  • Portuguese Web Archive
  • Web Archive Singapore
  • Web Archive Singapore
  • Catalonian Web Archive
  • Catalonian Web Archive
  • California Digital Library Web Archiving Service
  • Harvard University Web Archive CollectionService
  • CommonLIMITATIONS ANDWORKAROUNDS
  • limitation: banner displaces page elements
  • workaround: hide the banner
  • limitation: AJAX-enabled sites
  • limitation: AJAX-enabled sites
  • workaround: disable JavaScript
  • limitation: nav menu link errors
  • workaround: insert live site URL in archive
  • workaround: insert live site URL in archive
  • workaround: insert live site URL in archive
  • limitation: no full-text search
  • workaround: none yet, but R&D ongoing
  • BasicMECHANICS
  • structure of a Wayback Machine URLhttp://webarchiveqr.loc.gov/loc_sites/20120131201510/http://www.loc.gov/index.htmlWayback Machine URL collection date/timestamp URL of archived (YYYYMMDDHHMMSS) resource
  • URL-based access
  • URL-based access
  • date wildcarding
  • date wildcarding
  • document wildcarding
  • document wildcarding
  • document wildcarding
  • Strategies forFINDING MISSINGRESOURCES
  • removed or moved? don’t start with the archive missing resources have often just moved ( Klein & Nelson, 2010) Synchronicity for Firefox helps find new location  scrapes archived version for “fingerprint” keywords; uses them to query search engines
  • MementoFox
  • MementoFox
  • find archives for a site whose URL has changed website URL changed recently historical URL is unknown solution: use search engine to find historical URL then apply it in the archive
  • Federal IT Dashboard
  • check Internet Archive’s Wayback Machine
  • IA Wayback coverage goes back to July 2010
  • LCWA only goes back to June 2011
  • use search engine to find historical URL
  • use search engine to find historical URL
  • White House IT Dashboard announcement
  • note the redirect from http://it.usaspending.gov/
  • append URL to IA Wayback URL
  • append URL to LC Wayback URL
  • find archives for a site whose URL has changed congressional committee hearings archive live site URL doesn’t work in archive solution: find a site in the archive that would link to the desired site, then navigate to contemporaneous snapshot
  • hearings archive only spans 2001-2006
  • hearings archive URL changed in 2011
  • truncate archival access URL
  • snapshot from prior to site change
  • navigate to appropriate section
  • navigate to appropriate section
  • find archives for a previously accessible webpage records currently stored in password-protected part of site may have previously been publicly- accessible conceptual site organization lasts longer than exact link construction solution: figure out where desired resource would be on the live site, then navigate to analogous section on archived site
  • location of resources on live site
  • location of resources on live site
  • authentication required
  • check the site in the archive
  • navigate to an individual capture
  • navigate to appropriate section
  • navigate to appropriate section
  • How You CanGET INVOLVED
  • help us to help you what websites from today would you want to be able to consult in five, ten, twenty years’ time? have you told us what is important to capture?
  • End of Term 2012 Web Archive
  • OtherUSEFUL RESOURCES
  • End of Term 2008 Web Archive
  • CyberCemetery
  • LCWA
  • Project One Web Archives
  • links Library of Congress Web Archiving Program: http://www.loc.gov/webarchiving/ Library of Congress Web Archives: http:// loc.gov/lcwa/ International Internet Preservation Consortium: http://netpreserve.org/ National Digital Information Infrastructure and Preservation Program: http:// www.digitalpreservation.gov/
  • questions?webcapture@loc.gov