Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Intro to Web Archiving

596 views

Published on

Presented by Michele C. Weigle, June 26, 2018
ODU CS Machine Learning and Data Science Camp

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Intro to Web Archiving

  1. 1. Intro to Web Archiving Dr. Michele C. Weigle, @weiglemc Web Sciences and Digital Libraries (WS-DL) Group, @WebSciDL Department of Computer Science Old Dominion University June 26, 2018 ODU Machine Learning and Data Sciences Camp
  2. 2. @weiglemc, @WebSciDL ODU WS-DL Group • Web Sciences and Digital Libraries – digital preservation – web archiving – web science (social media analysis, web usage analysis) • Our recent work has been featured in the popular press June 26, 2018 2 @WebSciDL http://ws-dl.cs.odu.edu/ http://ws-dl.blogspot.com/
  3. 3. @weiglemc, @WebSciDL ODU WS-DL Group • Scott Ainsworth • Sawood Alam • Lulwah Alkwai • Mohamed Aturban • Brian Griffin • Hussam Hallak • Shawn Jones • Mat Kelly • Corren McCoy • Louis Nguyen • Alexander Nwala June 26, 2018 3 PhD Students • Nauman Siddique • Miranda Smith MS Students Coming in Fall 2018! • Dr. Sampath Jayarathna • Dr. Jian Wu • Dr. Michael L. Nelson • Dr. Michele C. Weigle Faculty @WebSciDL http://ws-dl.cs.odu.edu/ http://ws-dl.blogspot.com/
  4. 4. @weiglemc, @WebSciDL What is the past web? June 26, 2018 4
  5. 5. @weiglemc, @WebSciDL The Web holds our stories June 26, 2018 5
  6. 6. @weiglemc, @WebSciDL But webpages can disappear • Average lifespan of a webpage: 50-100 days • A year after publication, about 11% of content shared on social media will be gone. June 26, 2018 SalahEldeen and Nelson, "Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?", TPDL 2012 http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html 6
  7. 7. @weiglemc, @WebSciDL Maybe it's archived? June 26, 2018 7 https://archive.org/web
  8. 8. @weiglemc, @WebSciDL Why archives matter • Malaysia Airlines Flight 17 (MH17) • Ukrainian separatists originally took credit for downing a transport plane in that location • Later deleted the post • Internet Archive had archived the post before deletion June 26, 2018 8 http://www.csmonitor.com/World/Europe/2014/0717/Web- evidence-points-to-pro-Russia-rebels-in-downing-of-MH17-video
  9. 9. @weiglemc, @WebSciDL We can use archives to tell stories June 26, 2018 9 similar to our Hurricane Katrina example: https://www.slideshare.net/phonedude/why-careaboutthepast https://www.nytimes.com/2016/11/17/insider/in-13- headlines-the-drama-of-election-night.html
  10. 10. @weiglemc, @WebSciDL If something's gone from the live web, check a web archive June 26, 2018 10
  11. 11. @weiglemc, @WebSciDL Web archives to the rescue! June 26, 2018 11 https://twitter.com/brian3354/status/966081774194511874
  12. 12. @weiglemc, @WebSciDL Internet Archive's Wayback Machine has gone mainstream June 26, 2018 12 "God bless you Internet Archive" - Rachel Maddow, Dec 12, 2016 Last Week Tonight, Mar 18, 2018 Jill Lepore, "The Cobweb", The New Yorker, Jan 26, 2015
  13. 13. @weiglemc, @WebSciDL But Wayback is not Google • Wayback Machine has no full-text search – too big to be indexed – 654 billion web pages, 9 petabytes of data – growing at 20 TB/week • Enter URL and pick a date June 26, 2018 13 "It’s more like a phone book than like an archive." -Jill Lepore, The New Yorker
  14. 14. @weiglemc, @WebSciDL What do people think the Wayback Machine is? June 26, 2018 14 https://www.politico.com/story/2018/04/25/joy-reid-anti-gay-posts-550213
  15. 15. @weiglemc, @WebSciDL What do people think the Wayback Machine is? June 26, 2018 15 https://www.cnn.com/2018/02/16/politics/richard-pinedo-guilty-plea/index.html https://www.politico.com/story/2018/04/25/joy-reid-anti-gay-posts-550213 https://web.archive.org/web/20180115103952/https:/auctionessistance.com/
  16. 16. @weiglemc, @WebSciDL Caches are not archives June 26, 2018 16 http://ws-dl.blogspot.com/2018/01/2018-01-02-link-to-web-archives-not.html http://www.wired.co.uk/article/russia-propaganda-online-blog-longform-medium-posts https://webcache.googleusercontent.com/search?q=cache:qwqnGPqC2vsJ:https://medium.com/ %40TheFoundingSon/huffington-post-vs-whiteness-and-white-women- 1e67193085d4+&cd=15&hl=en&ct=clnk&gl=uk
  17. 17. @weiglemc, @WebSciDL Is it really that important to archive instead of just taking a screenshot? June 26, 2018 17 https://twitter.com/AngryBlackLady/status/990032514080108544 https://twitter.com/phonedude_mln/status/990070331737100288
  18. 18. @weiglemc, @WebSciDL We should be doing both June 26, 2018 18 https://twitter.com/conspirator0/status/1000475042017366017
  19. 19. @weiglemc, @WebSciDL “If you see something, save something” June 26, 2018 19 https://blog.archive.org/2017/01/25/see-something-save-something/
  20. 20. @weiglemc, @WebSciDL There's more than just the Internet Archive June 26, 2018 20 http://timetravel.mementoweb.org/list/20020908180610/http://blog.reidreport.com/
  21. 21. @weiglemc, @WebSciDL TimeTravel June 26, 2018 21 http://timetravel.mementoweb.org
  22. 22. @weiglemc, @WebSciDL Pro tip: submit pages to multiple archives June 26, 2018 22 https://twitter.com/phonedude_mln/status/998948823845261312
  23. 23. @weiglemc, @WebSciDL We've built tools to help people submit webpages to multiple archives • Mink – Google Chrome extension • #icanhazmemento – Twitter bot • ArchiveNow – Python module, Docker container, local web service June 26, 2018 23
  24. 24. @weiglemc, @WebSciDL Mink June 26, 2018 24 “Archive What I See Now: Bringing Institutional Web Archiving Tools to the Individual Researcher”, 2014-2017, HK-50181-14 Mat Kelly, Michael L. Nelson and Michele C. Weigle, "Mink: Integrating the Live and Archived Web Viewing Experience Using Web Browsers and Memento," JCDL 2014, poster. http://ws-dl.blogspot.com/2014/10/2014-10-03-integrating-live-and.html Google Chrome extension Submit currently viewed webpage to public archives https://github.com/machawk1/ Mink
  25. 25. @weiglemc, @WebSciDL #icanhazmemento June 26, 2018 25 http://ws-dl.blogspot.com/2015/07/2015-07-22-i-can-haz-memento.html Twitter bot Include #icanhazmemento in a tweet with a URL Bot replies with a link to the memento of the page closest to the time of the tweet If page not archived, bot submits URL to multiple public archives, replies with a link to the memento in Time Travel Alexander Nwala, "2015-07-22: I Can Haz Memento," https://github.com/anwala/icanhazmemento
  26. 26. @weiglemc, @WebSciDL ArchiveNow June 26, 2018 26 Mohamed Aturban, Mat Kelly, Sawood Alam, John Berlin, Michael L. Nelson and Michele C. Weigle, "ArchiveNow: Simplified, Extensible, Multi-Archive Preservation," JCDL 2018, poster. http://ws-dl.blogspot.com/2017/02/2017-02-22-archive-now-archivenow.html Python module, Docker container Submit URI to multiple archives “Towards a Web-Centric Approach for Capturing the Scholarly Record”, 2016-2019 https://github.com/oduwsdl/archivenow
  27. 27. @weiglemc, @WebSciDL Memento: Time Travel for the Web Access mementos in multiple web archives Memento’s core components: • A bridge between present and past: link and content negotiation • A bridge between past and present: link June 26, 2018 27
  28. 28. @weiglemc, @WebSciDL Memento Aggregator June 26, 2018 28
  29. 29. @weiglemc, @WebSciDL Memento Aggregator June 26, 2018 29
  30. 30. @weiglemc, @WebSciDL How can I use Memento? June 26, 2018 Memento for Chrome http://ws-dl.blogspot.com/2013/10/2013-10-14-right-click-to-past-memento.html http://ws-dl.blogspot.com/2014/10/2014-10-03-integrating-live-and.html http://timetravel.mementoweb.org 30 Mink
  31. 31. @weiglemc, @WebSciDL Use Mink to view the odu.edu of the past June 26, 2018 31
  32. 32. @weiglemc, @WebSciDL Click the Mink icon June 26, 2018 32
  33. 33. @weiglemc, @WebSciDL Then choose your datetime June 26, 2018 33
  34. 34. @weiglemc, @WebSciDL Archived odu.edu June 26, 2018 34
  35. 35. @weiglemc, @WebSciDL Fixing 404 Pages: Google Results Page June 26, 2018 35
  36. 36. @weiglemc, @WebSciDL Fixing 404 Pages: Result Page June 26, 2018 36 http://www.clashmusic.com/news/johnny-marr-leaves-the-cribs
  37. 37. @weiglemc, @WebSciDL Fixing 404 Pages: Scrolling Down June 26, 2018 37
  38. 38. @weiglemc, @WebSciDL Fixing 404 Pages: Server Up, Page 404 June 26, 2018 38
  39. 39. @weiglemc, @WebSciDL Fixing 404 Pages: Using Mink June 26, 2018 39
  40. 40. @weiglemc, @WebSciDL Fixing 404 Pages: Archived Page 2011- 04-16 June 26, 2018 40
  41. 41. @weiglemc, @WebSciDL #whatdiditlooklike June 26, 2018 41 http://ws-dl.blogspot.com/2015/01/2015-02-05-what-did-it-look-like.html Twitter bot Include #whatdiditlooklike in a tweet with a URL Bot generates animated GIF of first memento of each year Bot replies with a link to entry in Tumblr Tumblr: http://whatdiditlooklike.mementoweb.org/ Source: https://github.com/anwala/wdill Alexander Nwala, "2015-02-05: What Did It Look Like?,"
  42. 42. @weiglemc, @WebSciDL Use web archives to save the current web and view the past web • Web Science and Digital Libraries (WS-DL) group at ODU – ws-dl.blogspot.com, @WebSciDL (Twitter) • Websites/Tools for web archiving – Internet Archive's Wayback Machine - archive.org/web – On-demand archiving - archive.is – Memento Time Travel - timetravel.mementoweb.org – Mink - matkelly.com/mink/ – #icanhazmemento – #whatdiditlooklike June 26, 2018 42

×