Using Wayback Machine for Research

1,571 views
1,417 views

Published on

Presentation given at the Library of Congress on how to use Wayback Machine more effectively to answer historical research questions.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,571
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Mr. Peabody and Sherman’s time machine plot device from the television show “Rocky & Bullwinkle.”
  • The Wayback Machine most people are familiar with.
  • http://www.collectionscanada.gc.ca/webarchives/20071114183551/http://www.accord-treaty.gc.ca/main.asp?language=0
  • http://www.collectionscanada.gc.ca/webarchives/*/http://www.accord-treaty.gc.ca/main.asp?language=0
  • http://www.arquivo.pt/wayback/wayback/id4390263index3?l=en
  • http://was.nl.sg/wayback/20080404151626/http://www.biosingapore.org.sg/
  • http://was.nl.sg/wayback/*/http://www.biosingapore.org.sg/
  • http://www.padi.cat:8080/wayback/20120327044230/http://www.udg.edu/
  • http://www.padi.cat:8080/wayback/*/http://www.udg.edu/
  • http://webarchives.cdlib.org/sw16689n33/http://bawsca.org/
  • http://wax.lib.harvard.edu/collections/wayback.do?stamp=20080714184732&lang=eng&primColl=61&seed=175&liveWebUrl=tiffanni.blogspot.com%2F
  • When the Twitter link in the footer is clicked…
  • …the AJAX code truncates the URL, resulting in a blank page.
  • If you disable JavaScript in the browser and then click on the Twitter link, the page loads fine.
  • The navigation menu layout is awry and the links aren’t clickable.
  • Just because Wayback can’t properly rewrite the link doesn’t mean the crawler didn’t capture it. Navigate to the live site.
  • Find the desired URL.
  • Append the desired URL to the Wayback URL.
  • In the Library of Congress Web Archives, it’s only possible to search the bibliographic records.
  • The British Library and Internet Archive are exploring Lucene/Solr for full-text searching of web archives.
  • Note the live site URL.
  • Appending the live site URL to the Wayback URL takes you to a “snapshot” of that page in the archive.
  • Full date range is wildcarded (any date), so all snapshots for that URL are presented.
  • Date range is wildcarded to include only those captures from the specified year.
  • An individual page in the archive.
  • The time and specific resource are wildcarded, so it shows all resources captured for the specified domain on the specified day.
  • An example of one of the captured resources in the list.
  • Example of a live site.
  • Adjust the slider to request a Memento (i.e. archived resource) for the current URL.
  • We know that the website existed before then; how do we find it?
  • Copy the link to the IT Dashboard.
  • Additional captures from 2009 and 2010 are presented in the archive.
  • Additional captures from 2009 are presented in the archive.
  • The teleconference archives are in the events section.
  • If you click on any of the individual calls…
  • …you’re taken to an authentication page.
  • Even though the site URLs changed, there’s a decent chance that the teleconference archives were previously located in the events section.
  • Sure enough, they’re there, and not password-protected.
  • http://eotarchive.cdlib.org/2012.html
  • http://eotarchive.cdlib.org/search?browse-all=yes
  • http://govinfo.library.unt.edu/
  • http://lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html
  • Using Wayback Machine for Research

    1. 1. Nicholas Taylor Repository Development Group Using Wayback Machine for Research
    2. 2. WAYBACK MACHINE? What Is the
    3. 3. WABAC Machine?
    4. 4. Internet Archive’s Wayback Machine
    5. 5. not one, but many Wayback Machines  open source software to “replay” web archives  rewrites links to point to archived resources  allows for temporal navigation within archive  used by many web archiving institutions  33 out of 62 initiatives listed on Wikipedia
    6. 6. Government of Canada Web Archive
    7. 7. Government of Canada Web Archive
    8. 8. Portuguese Web Archive
    9. 9. Web Archive Singapore
    10. 10. Web Archive Singapore
    11. 11. Catalonian Web Archive
    12. 12. Catalonian Web Archive
    13. 13. California Digital Library Web Archiving Service
    14. 14. Harvard University Web Archive Collection Service
    15. 15. LIMITATIONS AND WORKAROUNDS Common
    16. 16. limitation: banner displaces page elements
    17. 17. workaround: hide the banner
    18. 18. limitation: AJAX-enabled sites
    19. 19. limitation: AJAX-enabled sites
    20. 20. workaround: disable JavaScript
    21. 21. limitation: nav menu link errors
    22. 22. workaround: insert live site URL in archive
    23. 23. workaround: insert live site URL in archive
    24. 24. workaround: insert live site URL in archive
    25. 25. limitation: no full-text search
    26. 26. workaround: none yet, but R&D ongoing
    27. 27. MECHANICS Basic
    28. 28. structure of a Wayback Machine URL http://webarchiveqr.loc.gov/loc_sites/20120131201510/http://www.loc.gov/index.html Wayback Machine URL collection date/timestamp (YYYYMMDDHHMMSS) URL of archived resource
    29. 29. URL-based access
    30. 30. URL-based access
    31. 31. date wildcarding
    32. 32. date wildcarding
    33. 33. document wildcarding
    34. 34. document wildcarding
    35. 35. document wildcarding
    36. 36. FINDING MISSING RESOURCES Strategies for
    37. 37. removed or moved?  don’t start with the archive  missing resources have often just moved ( Klein & Nelson, 2010)  Synchronicity for Firefox helps find new location  scrapes archived version for “fingerprint” keywords; uses them to query search engines
    38. 38. MementoFox
    39. 39. MementoFox
    40. 40. find archives for a site whose URL has changed  website URL changed recently  historical URL is unknown  solution: use search engine to find historical URL then apply it in the archive
    41. 41. Federal IT Dashboard
    42. 42. check Internet Archive’s Wayback Machine
    43. 43. IA Wayback coverage goes back to July 2010
    44. 44. LCWA only goes back to June 2011
    45. 45. use search engine to find historical URL
    46. 46. use search engine to find historical URL
    47. 47. White House IT Dashboard announcement
    48. 48. note the redirect from http://it.usaspending.gov/
    49. 49. append URL to IA Wayback URL
    50. 50. append URL to LC Wayback URL
    51. 51. find archives for a site whose URL has changed  congressional committee hearings archive  live site URL doesn’t work in archive  solution: find a site in the archive that would link to the desired site, then navigate to contemporaneous snapshot
    52. 52. hearings archive only spans 2001-2006
    53. 53. hearings archive URL changed in 2011
    54. 54. truncate archival access URL
    55. 55. snapshot from prior to site change
    56. 56. navigate to appropriate section
    57. 57. navigate to appropriate section
    58. 58. find archives for a previously accessible webpage  records currently stored in password-protected part of site may have previously been publicly- accessible  conceptual site organization lasts longer than exact link construction  solution: figure out where desired resource would be on the live site, then navigate to analogous section on archived site
    59. 59. location of resources on live site
    60. 60. location of resources on live site
    61. 61. authentication required
    62. 62. check the site in the archive
    63. 63. navigate to an individual capture
    64. 64. navigate to appropriate section
    65. 65. navigate to appropriate section
    66. 66. GET INVOLVED How You Can
    67. 67.  what websites from today would you want to be able to consult in five, ten, twenty years’ time?  have you told us what is important to capture? help us to help you
    68. 68. End of Term 2012 Web Archive
    69. 69. USEFUL RESOURCES Other
    70. 70. End of Term 2008 Web Archive
    71. 71. CyberCemetery
    72. 72. LCWA
    73. 73. Project One Web Archives
    74. 74. links  Library of Congress Web Archiving Program: http://www.loc.gov/webarchiving/  Library of Congress Web Archives: http:// loc.gov/lcwa/  International Internet Preservation Consortium: http://netpreserve.org/  National Digital Information Infrastructure and Preservation Program: http:// www.digitalpreservation.gov/
    75. 75. questions? webcapture@loc.gov

    ×