Intro to Web Archiving
Dr. Michele C. Weigle, @weiglemc
Web Sciences and Digital Libraries (WS-DL) Group, @WebSciDL
Department of Computer Science
Old Dominion University
June 26, 2018
ODU Machine Learning and Data Sciences Camp
@weiglemc, @WebSciDL
ODU WS-DL Group
• Web Sciences and Digital Libraries
– digital preservation
– web archiving
– web science (social media analysis, web usage analysis)
• Our recent work has been featured in the popular
press
June 26, 2018 2
@WebSciDL
http://ws-dl.cs.odu.edu/
http://ws-dl.blogspot.com/
@weiglemc, @WebSciDL
ODU WS-DL Group
• Scott Ainsworth
• Sawood Alam
• Lulwah Alkwai
• Mohamed Aturban
• Brian Griffin
• Hussam Hallak
• Shawn Jones
• Mat Kelly
• Corren McCoy
• Louis Nguyen
• Alexander Nwala
June 26, 2018 3
PhD Students
• Nauman Siddique
• Miranda Smith
MS Students
Coming in Fall 2018!
• Dr. Sampath Jayarathna
• Dr. Jian Wu
• Dr. Michael L. Nelson
• Dr. Michele C. Weigle
Faculty
@WebSciDL
http://ws-dl.cs.odu.edu/
http://ws-dl.blogspot.com/
@weiglemc, @WebSciDL
What is the past web?
June 26, 2018 4
@weiglemc, @WebSciDL
The Web holds our stories
June 26, 2018 5
@weiglemc, @WebSciDL
But webpages can disappear
• Average lifespan of a webpage: 50-100 days
• A year after publication, about 11% of
content shared on social media will be gone.
June 26, 2018
SalahEldeen and Nelson, "Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?", TPDL 2012
http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html
6
@weiglemc, @WebSciDL
Maybe it's archived?
June 26, 2018 7
https://archive.org/web
@weiglemc, @WebSciDL
Why archives matter
• Malaysia Airlines Flight
17 (MH17)
• Ukrainian separatists
originally took credit for
downing a transport
plane in that location
• Later deleted the post
• Internet Archive had
archived the post before
deletion
June 26, 2018 8
http://www.csmonitor.com/World/Europe/2014/0717/Web-
evidence-points-to-pro-Russia-rebels-in-downing-of-MH17-video
@weiglemc, @WebSciDL
We can use archives to tell stories
June 26, 2018 9
similar to our Hurricane Katrina example: https://www.slideshare.net/phonedude/why-careaboutthepast
https://www.nytimes.com/2016/11/17/insider/in-13-
headlines-the-drama-of-election-night.html
@weiglemc, @WebSciDL
If something's gone from the live
web, check a web archive
June 26, 2018 10
@weiglemc, @WebSciDL
Web archives to the rescue!
June 26, 2018 11
https://twitter.com/brian3354/status/966081774194511874
@weiglemc, @WebSciDL
Internet Archive's Wayback Machine
has gone mainstream
June 26, 2018 12
"God bless you Internet Archive"
- Rachel Maddow, Dec 12, 2016
Last Week Tonight, Mar 18, 2018
Jill Lepore, "The Cobweb", The New Yorker, Jan 26, 2015
@weiglemc, @WebSciDL
But Wayback is not Google
• Wayback Machine has no full-text search
– too big to be indexed
– 654 billion web pages, 9 petabytes of data
– growing at 20 TB/week
• Enter URL and pick a date
June 26, 2018 13
"It’s more like a phone book than like an archive."
-Jill Lepore, The New Yorker
@weiglemc, @WebSciDL
What do people think the Wayback
Machine is?
June 26, 2018 14
https://www.politico.com/story/2018/04/25/joy-reid-anti-gay-posts-550213
@weiglemc, @WebSciDL
What do people think the Wayback
Machine is?
June 26, 2018 15
https://www.cnn.com/2018/02/16/politics/richard-pinedo-guilty-plea/index.html
https://www.politico.com/story/2018/04/25/joy-reid-anti-gay-posts-550213
https://web.archive.org/web/20180115103952/https:/auctionessistance.com/
@weiglemc, @WebSciDL
Caches are not archives
June 26, 2018 16
http://ws-dl.blogspot.com/2018/01/2018-01-02-link-to-web-archives-not.html
http://www.wired.co.uk/article/russia-propaganda-online-blog-longform-medium-posts
https://webcache.googleusercontent.com/search?q=cache:qwqnGPqC2vsJ:https://medium.com/
%40TheFoundingSon/huffington-post-vs-whiteness-and-white-women-
1e67193085d4+&cd=15&hl=en&ct=clnk&gl=uk
@weiglemc, @WebSciDL
Is it really that important to archive
instead of just taking a screenshot?
June 26, 2018 17
https://twitter.com/AngryBlackLady/status/990032514080108544
https://twitter.com/phonedude_mln/status/990070331737100288
@weiglemc, @WebSciDL
We should be doing both
June 26, 2018 18
https://twitter.com/conspirator0/status/1000475042017366017
@weiglemc, @WebSciDL
“If you see something, save
something”
June 26, 2018 19
https://blog.archive.org/2017/01/25/see-something-save-something/
@weiglemc, @WebSciDL
There's more than just the Internet
Archive
June 26, 2018 20
http://timetravel.mementoweb.org/list/20020908180610/http://blog.reidreport.com/
@weiglemc, @WebSciDL
TimeTravel
June 26, 2018 21
http://timetravel.mementoweb.org
@weiglemc, @WebSciDL
Pro tip: submit pages to multiple
archives
June 26, 2018 22
https://twitter.com/phonedude_mln/status/998948823845261312
@weiglemc, @WebSciDL
We've built tools to help people
submit webpages to multiple archives
• Mink – Google Chrome extension
• #icanhazmemento – Twitter bot
• ArchiveNow – Python module, Docker
container, local web service
June 26, 2018 23
@weiglemc, @WebSciDL
Mink
June 26, 2018 24
“Archive What I See Now: Bringing Institutional Web Archiving Tools to the Individual Researcher”,
2014-2017, HK-50181-14
Mat Kelly, Michael L. Nelson and Michele C. Weigle, "Mink: Integrating the Live and Archived Web Viewing
Experience Using Web Browsers and Memento," JCDL 2014, poster.
http://ws-dl.blogspot.com/2014/10/2014-10-03-integrating-live-and.html
Google Chrome extension
Submit currently viewed
webpage to public
archives
https://github.com/machawk1/
Mink
@weiglemc, @WebSciDL
#icanhazmemento
June 26, 2018 25
http://ws-dl.blogspot.com/2015/07/2015-07-22-i-can-haz-memento.html
Twitter bot
Include #icanhazmemento in a
tweet with a URL
Bot replies with a link to the
memento of the page closest to
the time of the tweet
If page not archived, bot submits
URL to multiple public archives,
replies with a link to the
memento in Time Travel
Alexander Nwala, "2015-07-22: I Can Haz Memento,"
https://github.com/anwala/icanhazmemento
@weiglemc, @WebSciDL
ArchiveNow
June 26, 2018 26
Mohamed Aturban, Mat Kelly, Sawood Alam, John Berlin, Michael L. Nelson and Michele C. Weigle,
"ArchiveNow: Simplified, Extensible, Multi-Archive Preservation," JCDL 2018, poster.
http://ws-dl.blogspot.com/2017/02/2017-02-22-archive-now-archivenow.html
Python module, Docker
container
Submit URI to multiple
archives
“Towards a Web-Centric Approach for Capturing the Scholarly Record”, 2016-2019
https://github.com/oduwsdl/archivenow
@weiglemc, @WebSciDL
Memento: Time Travel for the Web
Access mementos in
multiple web archives
Memento’s core
components:
• A bridge between
present and past: link
and content
negotiation
• A bridge between past
and present: link
June 26, 2018 27
@weiglemc, @WebSciDL
Memento Aggregator
June 26, 2018 28
@weiglemc, @WebSciDL
Memento Aggregator
June 26, 2018 29
@weiglemc, @WebSciDL
How can I use Memento?
June 26, 2018
Memento for Chrome
http://ws-dl.blogspot.com/2013/10/2013-10-14-right-click-to-past-memento.html
http://ws-dl.blogspot.com/2014/10/2014-10-03-integrating-live-and.html
http://timetravel.mementoweb.org
30
Mink
@weiglemc, @WebSciDL
Use Mink to view the odu.edu of the
past
June 26, 2018 31
@weiglemc, @WebSciDL
Click the Mink icon
June 26, 2018 32
@weiglemc, @WebSciDL
Then choose your datetime
June 26, 2018 33
@weiglemc, @WebSciDL
Archived odu.edu
June 26, 2018 34
@weiglemc, @WebSciDL
Fixing 404 Pages: Google Results Page
June 26, 2018 35
@weiglemc, @WebSciDL
Fixing 404 Pages: Result Page
June 26, 2018 36
http://www.clashmusic.com/news/johnny-marr-leaves-the-cribs
@weiglemc, @WebSciDL
Fixing 404 Pages: Scrolling Down
June 26, 2018 37
@weiglemc, @WebSciDL
Fixing 404 Pages: Server Up, Page 404
June 26, 2018 38
@weiglemc, @WebSciDL
Fixing 404 Pages: Using Mink
June 26, 2018 39
@weiglemc, @WebSciDL
Fixing 404 Pages: Archived Page 2011-
04-16
June 26, 2018 40
@weiglemc, @WebSciDL
#whatdiditlooklike
June 26, 2018 41
http://ws-dl.blogspot.com/2015/01/2015-02-05-what-did-it-look-like.html
Twitter bot
Include #whatdiditlooklike in a
tweet with a URL
Bot generates animated GIF of first
memento of each year
Bot replies with a link to entry in
Tumblr
Tumblr:
http://whatdiditlooklike.mementoweb.org/
Source:
https://github.com/anwala/wdill
Alexander Nwala, "2015-02-05: What Did It Look Like?,"
@weiglemc, @WebSciDL
Use web archives to save the current
web and view the past web
• Web Science and Digital Libraries (WS-DL) group at
ODU
– ws-dl.blogspot.com, @WebSciDL (Twitter)
• Websites/Tools for web archiving
– Internet Archive's Wayback Machine - archive.org/web
– On-demand archiving - archive.is
– Memento Time Travel - timetravel.mementoweb.org
– Mink - matkelly.com/mink/
– #icanhazmemento
– #whatdiditlooklike
June 26, 2018 42

Intro to Web Archiving

  • 1.
    Intro to WebArchiving Dr. Michele C. Weigle, @weiglemc Web Sciences and Digital Libraries (WS-DL) Group, @WebSciDL Department of Computer Science Old Dominion University June 26, 2018 ODU Machine Learning and Data Sciences Camp
  • 2.
    @weiglemc, @WebSciDL ODU WS-DLGroup • Web Sciences and Digital Libraries – digital preservation – web archiving – web science (social media analysis, web usage analysis) • Our recent work has been featured in the popular press June 26, 2018 2 @WebSciDL http://ws-dl.cs.odu.edu/ http://ws-dl.blogspot.com/
  • 3.
    @weiglemc, @WebSciDL ODU WS-DLGroup • Scott Ainsworth • Sawood Alam • Lulwah Alkwai • Mohamed Aturban • Brian Griffin • Hussam Hallak • Shawn Jones • Mat Kelly • Corren McCoy • Louis Nguyen • Alexander Nwala June 26, 2018 3 PhD Students • Nauman Siddique • Miranda Smith MS Students Coming in Fall 2018! • Dr. Sampath Jayarathna • Dr. Jian Wu • Dr. Michael L. Nelson • Dr. Michele C. Weigle Faculty @WebSciDL http://ws-dl.cs.odu.edu/ http://ws-dl.blogspot.com/
  • 4.
    @weiglemc, @WebSciDL What isthe past web? June 26, 2018 4
  • 5.
    @weiglemc, @WebSciDL The Webholds our stories June 26, 2018 5
  • 6.
    @weiglemc, @WebSciDL But webpagescan disappear • Average lifespan of a webpage: 50-100 days • A year after publication, about 11% of content shared on social media will be gone. June 26, 2018 SalahEldeen and Nelson, "Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?", TPDL 2012 http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html 6
  • 7.
    @weiglemc, @WebSciDL Maybe it'sarchived? June 26, 2018 7 https://archive.org/web
  • 8.
    @weiglemc, @WebSciDL Why archivesmatter • Malaysia Airlines Flight 17 (MH17) • Ukrainian separatists originally took credit for downing a transport plane in that location • Later deleted the post • Internet Archive had archived the post before deletion June 26, 2018 8 http://www.csmonitor.com/World/Europe/2014/0717/Web- evidence-points-to-pro-Russia-rebels-in-downing-of-MH17-video
  • 9.
    @weiglemc, @WebSciDL We canuse archives to tell stories June 26, 2018 9 similar to our Hurricane Katrina example: https://www.slideshare.net/phonedude/why-careaboutthepast https://www.nytimes.com/2016/11/17/insider/in-13- headlines-the-drama-of-election-night.html
  • 10.
    @weiglemc, @WebSciDL If something'sgone from the live web, check a web archive June 26, 2018 10
  • 11.
    @weiglemc, @WebSciDL Web archivesto the rescue! June 26, 2018 11 https://twitter.com/brian3354/status/966081774194511874
  • 12.
    @weiglemc, @WebSciDL Internet Archive'sWayback Machine has gone mainstream June 26, 2018 12 "God bless you Internet Archive" - Rachel Maddow, Dec 12, 2016 Last Week Tonight, Mar 18, 2018 Jill Lepore, "The Cobweb", The New Yorker, Jan 26, 2015
  • 13.
    @weiglemc, @WebSciDL But Waybackis not Google • Wayback Machine has no full-text search – too big to be indexed – 654 billion web pages, 9 petabytes of data – growing at 20 TB/week • Enter URL and pick a date June 26, 2018 13 "It’s more like a phone book than like an archive." -Jill Lepore, The New Yorker
  • 14.
    @weiglemc, @WebSciDL What dopeople think the Wayback Machine is? June 26, 2018 14 https://www.politico.com/story/2018/04/25/joy-reid-anti-gay-posts-550213
  • 15.
    @weiglemc, @WebSciDL What dopeople think the Wayback Machine is? June 26, 2018 15 https://www.cnn.com/2018/02/16/politics/richard-pinedo-guilty-plea/index.html https://www.politico.com/story/2018/04/25/joy-reid-anti-gay-posts-550213 https://web.archive.org/web/20180115103952/https:/auctionessistance.com/
  • 16.
    @weiglemc, @WebSciDL Caches arenot archives June 26, 2018 16 http://ws-dl.blogspot.com/2018/01/2018-01-02-link-to-web-archives-not.html http://www.wired.co.uk/article/russia-propaganda-online-blog-longform-medium-posts https://webcache.googleusercontent.com/search?q=cache:qwqnGPqC2vsJ:https://medium.com/ %40TheFoundingSon/huffington-post-vs-whiteness-and-white-women- 1e67193085d4+&cd=15&hl=en&ct=clnk&gl=uk
  • 17.
    @weiglemc, @WebSciDL Is itreally that important to archive instead of just taking a screenshot? June 26, 2018 17 https://twitter.com/AngryBlackLady/status/990032514080108544 https://twitter.com/phonedude_mln/status/990070331737100288
  • 18.
    @weiglemc, @WebSciDL We shouldbe doing both June 26, 2018 18 https://twitter.com/conspirator0/status/1000475042017366017
  • 19.
    @weiglemc, @WebSciDL “If yousee something, save something” June 26, 2018 19 https://blog.archive.org/2017/01/25/see-something-save-something/
  • 20.
    @weiglemc, @WebSciDL There's morethan just the Internet Archive June 26, 2018 20 http://timetravel.mementoweb.org/list/20020908180610/http://blog.reidreport.com/
  • 21.
    @weiglemc, @WebSciDL TimeTravel June 26,2018 21 http://timetravel.mementoweb.org
  • 22.
    @weiglemc, @WebSciDL Pro tip:submit pages to multiple archives June 26, 2018 22 https://twitter.com/phonedude_mln/status/998948823845261312
  • 23.
    @weiglemc, @WebSciDL We've builttools to help people submit webpages to multiple archives • Mink – Google Chrome extension • #icanhazmemento – Twitter bot • ArchiveNow – Python module, Docker container, local web service June 26, 2018 23
  • 24.
    @weiglemc, @WebSciDL Mink June 26,2018 24 “Archive What I See Now: Bringing Institutional Web Archiving Tools to the Individual Researcher”, 2014-2017, HK-50181-14 Mat Kelly, Michael L. Nelson and Michele C. Weigle, "Mink: Integrating the Live and Archived Web Viewing Experience Using Web Browsers and Memento," JCDL 2014, poster. http://ws-dl.blogspot.com/2014/10/2014-10-03-integrating-live-and.html Google Chrome extension Submit currently viewed webpage to public archives https://github.com/machawk1/ Mink
  • 25.
    @weiglemc, @WebSciDL #icanhazmemento June 26,2018 25 http://ws-dl.blogspot.com/2015/07/2015-07-22-i-can-haz-memento.html Twitter bot Include #icanhazmemento in a tweet with a URL Bot replies with a link to the memento of the page closest to the time of the tweet If page not archived, bot submits URL to multiple public archives, replies with a link to the memento in Time Travel Alexander Nwala, "2015-07-22: I Can Haz Memento," https://github.com/anwala/icanhazmemento
  • 26.
    @weiglemc, @WebSciDL ArchiveNow June 26,2018 26 Mohamed Aturban, Mat Kelly, Sawood Alam, John Berlin, Michael L. Nelson and Michele C. Weigle, "ArchiveNow: Simplified, Extensible, Multi-Archive Preservation," JCDL 2018, poster. http://ws-dl.blogspot.com/2017/02/2017-02-22-archive-now-archivenow.html Python module, Docker container Submit URI to multiple archives “Towards a Web-Centric Approach for Capturing the Scholarly Record”, 2016-2019 https://github.com/oduwsdl/archivenow
  • 27.
    @weiglemc, @WebSciDL Memento: TimeTravel for the Web Access mementos in multiple web archives Memento’s core components: • A bridge between present and past: link and content negotiation • A bridge between past and present: link June 26, 2018 27
  • 28.
  • 29.
  • 30.
    @weiglemc, @WebSciDL How canI use Memento? June 26, 2018 Memento for Chrome http://ws-dl.blogspot.com/2013/10/2013-10-14-right-click-to-past-memento.html http://ws-dl.blogspot.com/2014/10/2014-10-03-integrating-live-and.html http://timetravel.mementoweb.org 30 Mink
  • 31.
    @weiglemc, @WebSciDL Use Minkto view the odu.edu of the past June 26, 2018 31
  • 32.
    @weiglemc, @WebSciDL Click theMink icon June 26, 2018 32
  • 33.
    @weiglemc, @WebSciDL Then chooseyour datetime June 26, 2018 33
  • 34.
  • 35.
    @weiglemc, @WebSciDL Fixing 404Pages: Google Results Page June 26, 2018 35
  • 36.
    @weiglemc, @WebSciDL Fixing 404Pages: Result Page June 26, 2018 36 http://www.clashmusic.com/news/johnny-marr-leaves-the-cribs
  • 37.
    @weiglemc, @WebSciDL Fixing 404Pages: Scrolling Down June 26, 2018 37
  • 38.
    @weiglemc, @WebSciDL Fixing 404Pages: Server Up, Page 404 June 26, 2018 38
  • 39.
    @weiglemc, @WebSciDL Fixing 404Pages: Using Mink June 26, 2018 39
  • 40.
    @weiglemc, @WebSciDL Fixing 404Pages: Archived Page 2011- 04-16 June 26, 2018 40
  • 41.
    @weiglemc, @WebSciDL #whatdiditlooklike June 26,2018 41 http://ws-dl.blogspot.com/2015/01/2015-02-05-what-did-it-look-like.html Twitter bot Include #whatdiditlooklike in a tweet with a URL Bot generates animated GIF of first memento of each year Bot replies with a link to entry in Tumblr Tumblr: http://whatdiditlooklike.mementoweb.org/ Source: https://github.com/anwala/wdill Alexander Nwala, "2015-02-05: What Did It Look Like?,"
  • 42.
    @weiglemc, @WebSciDL Use webarchives to save the current web and view the past web • Web Science and Digital Libraries (WS-DL) group at ODU – ws-dl.blogspot.com, @WebSciDL (Twitter) • Websites/Tools for web archiving – Internet Archive's Wayback Machine - archive.org/web – On-demand archiving - archive.is – Memento Time Travel - timetravel.mementoweb.org – Mink - matkelly.com/mink/ – #icanhazmemento – #whatdiditlooklike June 26, 2018 42