Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Tools for Managing the Past Web
Dr. Michele C. Weigle
Web Sciences and Digital Libraries (WS-DL) Group
Department of Compu...
What is the past web?
February 20, 2015 2
Why should I care about the
past web and web archives?
The Web holds our stories
February 20, 2015 4
But webpages can disappear
• Average lifespan of a webpage: 50-100 days
• A year after publication, about 11% of content
s...
Maybe it's archived?
February 20, 2015 6
archive.org/web
Why archives matter
• Malaysia Airlines Flight
17 (MH17)
• Ukrainian separatists
originally took credit for
downing a tran...
Web archiving in the news - 2015
February 20, 2015 8
http://www.newyorker.com/magazine/2015/01/26/cobweb
But Wayback is not Google
• Wayback Machine has no full-text search
– too big to be indexed
– 452 billion web pages, 9 pet...
The Internet Archive isn't the
only archive in town
#ofarchivedpages
How can I access the
archives?
February 20, 2015
MementoFox
Memento for Chrome
http://ws-dl.blogspot.com/2010/03/2010-03-1...
TimeTravel
February 20, 2015 12
http://timetravel.mementoweb.org
ODU WS-DL Projects
Tools for Managing the Past Web
• Archive Quality
• Tweet Intention
• TimeMap
Summaries
• Archive What ...
ODU WS-DL Projects
Tools for Managing the Past Web
• Archive Quality
• Tweet Intention
• TimeMap
Summaries
• Archive What ...
The State of Web Archiving
"Hooray! It's in the archive!"
vs.
"How well was it archived?"
current:
future:
February 20, 20...
Damaged Memento
February 20, 2015 16
How damaged are these mementos?
February 20, 2015
M = 0.17
(live web)
Brunelle, Kelly, SalahEldeen, Weigle, and Nelson, "N...
How damaged are these mementos?
February 20, 2015
M = 0.17
(live web)
M = 0.24
(missing main)
Brunelle, Kelly, SalahEldeen...
How damaged are these mementos?
February 20, 2015
M = 0.17
(live web)
M = 0.24
(missing main)
M = 0.29
(missing logo + nav...
How damaged are these mementos?
February 20, 2015
M = 0.17
D = 0.09
(live web)
M = 0.24
D = 0.41
(missing main)
M = 0.29
D...
How to detect damage?
February 20, 2015
vs.
Brunelle et al., JCDL 2014
21
February 20, 2015
Good News:
Although M is steady/increasing, D is decreasing
22
M = percentage missing
D = our damage met...
Using JavaScript can result in
damaged mementos
February 20, 2015 23
JavaScript is
responsible for an
increasing proportio...
http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html
Sept 3, 2008
2012
Sometimes the live web "leaks" int...
Different parts of a page can be
crawled at different times
February 20, 2015
Ainsworth and Nelson, "Evaluating Sliding an...
ODU WS-DL Projects
Tools for Managing the Past Web
• Archive Quality
• Tweet Intention
• TimeMap
Summaries
• Archive What ...
Which page did Chris Hayes
mean to tweet?
February 20, 2015 27
Tweet on Oct 3, 2014
Likely target (captured Oct 1, 2014)
What you see depends on
when you click
February 20, 2015 28
Oct 9, 2014
Oct 10, 2014
Nov 19-Dec 15, 2014 Today (Feb 2015) ...
Mapping Tweet Relevance
February 20, 2015 29
SalahEldeen and Nelson, "Reading the Correct History? Modeling Temporal Inten...
Let the reader choose live or
archived
February 20, 2015 30
ODU WS-DL Projects
Tools for Managing the Past Web
• Archive Quality
• Tweet Intention
• TimeMap
Summaries
• Archive What ...
Browsing TimeMaps
February 20, 2015 32
How were
these 4
thumbnails
chosen?
What did usps.com look like?
February 20, 2015 33
http://whatdiditlooklike.mementoweb.org/
Animated GIF
1st memento of eac...
Which tells you more about the
past of www.apple.com?
February 20, 2015
700 thumbnails
(not even all of them!)
32 sampled ...
TimeMap Thumbnail
Summaries
• Compare HTML, not images
• Compute SimHash of HTML
– result is a string representing the con...
Grid View
February 20, 2015 36
Cover Flow View
February 20, 2015 37
Embed in Wayback
February 20, 2015 38
ODU WS-DL Projects
Tools for Managing the Past Web
• Archive Quality
• Tweet Intention
• TimeMap
Summaries
• Archive What ...
Archive What I See Now
• Humanities
researchers know
they should
archive web
resources
• Standard web
archiving tools are
...
Why not just take a screenshot or
“save as”?
February 20, 2015
Can't interact with
a screenshot
"Save Page As..."output is...
What about archiving pages behind
authentication or that change quickly?
February 20, 2015
Facebook - requires login
Twitt...
How we're addressing the problem
• Google Chrome extension
• Archive the current state
of the page in standard
Web Archive...
WARCreate - Work in Progress
• New modes of operation
– record mode
• while activated, add capture of each page visited to...
What to do with created WARCs?
February 20, 2015 45
Kelly, Weigle, and Nelson. "Making Enterprise-Level Archive Tools Acce...
Bridging the gap between the past web
and the live web
February 20, 2015
Mink
46
Kelly, Nelson, and Weigle, "Mink: Integra...
Tools
February 20, 2015 47
WARCreate
Mink
WAIL
https://ws-dl.cs.odu.edu/Software
ODU WS-DL Projects
Tools for Managing the Past Web
• Archive Quality
• Tweet Intention
• TimeMap
Summaries
• Archive What ...
Storify
February 20, 2015
https://storify.com/nzherald/mu
49
Bookmarking is not preserving
February 20, 2015 50
Bookmarking is not preserving
February 20, 2015 51
Archive-It Collections
February 20, 2015 52
https://archive-it.org/collections/2358
Storytelling For Archives
Archived collectionsStorytelling services
Archived enriched
stories
February 20, 2015 53
AlNoama...
Tools for Storytelling
• Tools for Users
– use existing tools like Storify to view the stories of
a collection
• Tools for...
Story Types
Fixed Page – Fixed Time:
differences in GeoIP,
mobile, etc.
Fixed Page – Sliding Time:
evolution of a single p...
ODU WS-DL Projects
Tools for Managing the Past Web
• Archive Quality
• Tweet Intention
• TimeMap
Summaries
• Archive What ...
Web Sciences and Digital Libraries
Group (WS-DL)
• Scott Ainsworth
• Sawood Alam
• Lulwah Alkwai
• Yasmin AlNoamany
• Moha...
Upcoming SlideShare
Loading in …5
×

0

Share

Download to read offline

2015-odu-ece-tools-for-past-web

Download to read offline

Tools for Managing the Past Web
ODU - ECE Seminar
February 20, 2015
Presented by Michele Weigle

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

2015-odu-ece-tools-for-past-web

  1. 1. Tools for Managing the Past Web Dr. Michele C. Weigle Web Sciences and Digital Libraries (WS-DL) Group Department of Computer Science Old Dominion University ODU - ECE Seminar February 20, 2015
  2. 2. What is the past web? February 20, 2015 2
  3. 3. Why should I care about the past web and web archives?
  4. 4. The Web holds our stories February 20, 2015 4
  5. 5. But webpages can disappear • Average lifespan of a webpage: 50-100 days • A year after publication, about 11% of content shared on social media will be gone. February 20, 2015 SalahEldeen and Nelson, "Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?", TPDL 2012 http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html 5
  6. 6. Maybe it's archived? February 20, 2015 6 archive.org/web
  7. 7. Why archives matter • Malaysia Airlines Flight 17 (MH17) • Ukrainian separatists originally took credit for downing a transport plane in that location • Later deleted the post • Internet Archive had archived the post before deletion February 20, 2015 7 http://www.csmonitor.com/World/Europe/2014/0717/Web- evidence-points-to-pro-Russia-rebels-in-downing-of-MH17-video
  8. 8. Web archiving in the news - 2015 February 20, 2015 8 http://www.newyorker.com/magazine/2015/01/26/cobweb
  9. 9. But Wayback is not Google • Wayback Machine has no full-text search – too big to be indexed – 452 billion web pages, 9 petabytes of data – growing at 20 TB/week • Enter URL and pick a date February 20, 2015 9 "It’s more like a phone book than like an archive." -Jill Lepore, The New Yorker
  10. 10. The Internet Archive isn't the only archive in town #ofarchivedpages
  11. 11. How can I access the archives? February 20, 2015 MementoFox Memento for Chrome http://ws-dl.blogspot.com/2010/03/2010-03-19-mementofox-add-on-released.html http://ws-dl.blogspot.com/2013/10/2013-10-14-right-click-to-past-memento.html http://ws-dl.blogspot.com/2014/10/2014-10-03-integrating-live-and.html Mink http://www.mementoweb.org 11
  12. 12. TimeTravel February 20, 2015 12 http://timetravel.mementoweb.org
  13. 13. ODU WS-DL Projects Tools for Managing the Past Web • Archive Quality • Tweet Intention • TimeMap Summaries • Archive What I See Now • Storytelling for Archives February 20, 2015 13
  14. 14. ODU WS-DL Projects Tools for Managing the Past Web • Archive Quality • Tweet Intention • TimeMap Summaries • Archive What I See Now • Storytelling for Archives February 20, 2015 14
  15. 15. The State of Web Archiving "Hooray! It's in the archive!" vs. "How well was it archived?" current: future: February 20, 2015 15
  16. 16. Damaged Memento February 20, 2015 16
  17. 17. How damaged are these mementos? February 20, 2015 M = 0.17 (live web) Brunelle, Kelly, SalahEldeen, Weigle, and Nelson, "Not All Mementos Are Created Equal: Measuring the Impact of Missing Resources", JCDL 2014, Best Student Paper 17
  18. 18. How damaged are these mementos? February 20, 2015 M = 0.17 (live web) M = 0.24 (missing main) Brunelle, Kelly, SalahEldeen, Weigle, and Nelson, "Not All Mementos Are Created Equal: Measuring the Impact of Missing Resources", JCDL 2014, Best Student Paper 18
  19. 19. How damaged are these mementos? February 20, 2015 M = 0.17 (live web) M = 0.24 (missing main) M = 0.29 (missing logo + navigation) Brunelle, Kelly, SalahEldeen, Weigle, and Nelson, "Not All Mementos Are Created Equal: Measuring the Impact of Missing Resources", JCDL 2014, Best Student Paper 19
  20. 20. How damaged are these mementos? February 20, 2015 M = 0.17 D = 0.09 (live web) M = 0.24 D = 0.41 (missing main) M = 0.29 D = 0.36 (missing logo + navigation) Brunelle, Kelly, SalahEldeen, Weigle, and Nelson, "Not All Mementos Are Created Equal: Measuring the Impact of Missing Resources", JCDL 2014, Best Student Paper 20
  21. 21. How to detect damage? February 20, 2015 vs. Brunelle et al., JCDL 2014 21
  22. 22. February 20, 2015 Good News: Although M is steady/increasing, D is decreasing 22 M = percentage missing D = our damage metric Sampled 45,000 mementos - one memento/year of ~1850 webpages - webpages from Bitly URIs shared over Twitter and Archive-It collections Brunelle et al., JCDL 2014
  23. 23. Using JavaScript can result in damaged mementos February 20, 2015 23 JavaScript is responsible for an increasing proportion of missing embedded resources over time. Brunelle, Kelly, Weigle and Nelson, "The Impact of JavaScript on Archivability," International Journal of Digital Libraries (IJDL), 2015
  24. 24. http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html Sept 3, 2008 2012 Sometimes the live web "leaks" into the archive February 20, 2015 24
  25. 25. Different parts of a page can be crawled at different times February 20, 2015 Ainsworth and Nelson, "Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in Acyclic Walks Through a Web Archive", JCDL 2013 25
  26. 26. ODU WS-DL Projects Tools for Managing the Past Web • Archive Quality • Tweet Intention • TimeMap Summaries • Archive What I See Now • Storytelling for Archives February 20, 2015 26
  27. 27. Which page did Chris Hayes mean to tweet? February 20, 2015 27 Tweet on Oct 3, 2014 Likely target (captured Oct 1, 2014)
  28. 28. What you see depends on when you click February 20, 2015 28 Oct 9, 2014 Oct 10, 2014 Nov 19-Dec 15, 2014 Today (Feb 2015) – now fergusonaction.com
  29. 29. Mapping Tweet Relevance February 20, 2015 29 SalahEldeen and Nelson, "Reading the Correct History? Modeling Temporal Intention in Resource Sharing”, JCDL 2013
  30. 30. Let the reader choose live or archived February 20, 2015 30
  31. 31. ODU WS-DL Projects Tools for Managing the Past Web • Archive Quality • Tweet Intention • TimeMap Summaries • Archive What I See Now • Storytelling for Archives February 20, 2015 31
  32. 32. Browsing TimeMaps February 20, 2015 32 How were these 4 thumbnails chosen?
  33. 33. What did usps.com look like? February 20, 2015 33 http://whatdiditlooklike.mementoweb.org/ Animated GIF 1st memento of each year Submit a URL via Twitter: “#whatdiditlooklike URL”
  34. 34. Which tells you more about the past of www.apple.com? February 20, 2015 700 thumbnails (not even all of them!) 32 sampled thumbnails 34 AlSum and Nelson, "Thumbnail Summarization Techniques for Web Archives", ECIR 2014
  35. 35. TimeMap Thumbnail Summaries • Compare HTML, not images • Compute SimHash of HTML – result is a string representing the content of the page • Calculate Hamming distance between SimHashes of consecutive mementos • Generate thumbnails of mementos that have at least a 4 character difference in SimHash – threshold too low -> near duplicate images – threshold too high -> miss important changes February 20, 2015 35 3 lines of difference AlSum and Nelson, "Thumbnail Summarization Techniques for Web Archives", ECIR 2014
  36. 36. Grid View February 20, 2015 36
  37. 37. Cover Flow View February 20, 2015 37
  38. 38. Embed in Wayback February 20, 2015 38
  39. 39. ODU WS-DL Projects Tools for Managing the Past Web • Archive Quality • Tweet Intention • TimeMap Summaries • Archive What I See Now • Storytelling for Archives February 20, 2015 39
  40. 40. Archive What I See Now • Humanities researchers know they should archive web resources • Standard web archiving tools are difficult for non IT experts February 20, 2015 "Archive What I See Now", NEH Digital Humanities Implementation Grant, 2014-2017, http://bit.ly/odu-dhig-2014 40
  41. 41. Why not just take a screenshot or “save as”? February 20, 2015 Can't interact with a screenshot "Save Page As..."output is difficult to keep organized -- especially with multiple captures over time 41
  42. 42. What about archiving pages behind authentication or that change quickly? February 20, 2015 Facebook - requires login Twitter - changes faster than typical crawling rate 42
  43. 43. How we're addressing the problem • Google Chrome extension • Archive the current state of the page in standard Web Archive (WARC) format • Compatible with Wayback February 20, 2015 43 Kelly and Weigle, "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage", JCDL 2012 Kelly, Weigle, and Nelson. "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage," Digital Preservation 2012, Tools Demo Session WARCreate
  44. 44. WARCreate - Work in Progress • New modes of operation – record mode • while activated, add capture of each page visited to the WARC – countdown mode • every interval, refresh and add new capture of page – event mode • add new capture of page every time it dynamically reloads or refreshes February 20, 2015 44
  45. 45. What to do with created WARCs? February 20, 2015 45 Kelly, Weigle, and Nelson. "Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving," Personal Digital Archiving 2013, Poster Session Kelly, Nelson, and Weigle. "WARCreate and WAIL: WARC, Wayback and Heritrix Made Easy," Digital Preservation 2013 WAIL • Load created WARCs into a Wayback instance on your local computer • Single-click install of Wayback (and other archiving tools) • Available for Windows, OS X
  46. 46. Bridging the gap between the past web and the live web February 20, 2015 Mink 46 Kelly, Nelson, and Weigle, "Mink: Integrating the Live and Archived Web Viewing Experience Using Web Browsers and Memento," poster, ACM/IEEE Digital Libraries (DL), September 2014. • Google Chrome extension • For each page you visit, displays the number of archived versions available • Provides access by date • Allows for submission to public archiving services
  47. 47. Tools February 20, 2015 47 WARCreate Mink WAIL https://ws-dl.cs.odu.edu/Software
  48. 48. ODU WS-DL Projects Tools for Managing the Past Web • Archive Quality • Tweet Intention • TimeMap Summaries • Archive What I See Now • Storytelling for Archives February 20, 2015 48
  49. 49. Storify February 20, 2015 https://storify.com/nzherald/mu 49
  50. 50. Bookmarking is not preserving February 20, 2015 50
  51. 51. Bookmarking is not preserving February 20, 2015 51
  52. 52. Archive-It Collections February 20, 2015 52 https://archive-it.org/collections/2358
  53. 53. Storytelling For Archives Archived collectionsStorytelling services Archived enriched stories February 20, 2015 53 AlNoamany, "Using Web Archives to Enrich the Live Web Experience Through Storytelling", TCDL Bulletin, December 2013.
  54. 54. Tools for Storytelling • Tools for Users – use existing tools like Storify to view the stories of a collection • Tools for Curators – use existing stories to augment your collections – create stories from your collections • candidate mementos automatically selected February 20, 2015 54
  55. 55. Story Types Fixed Page – Fixed Time: differences in GeoIP, mobile, etc. Fixed Page – Sliding Time: evolution of a single page (or domain) through time Sliding Page – Fixed Time: different perspectives on a point in time Sliding Page – Sliding Time: broadest possible coverage of a collection same Time different URI same different Issues: topic modeling, eliminating duplicates, maximizing novelty, structural & content quality February 20, 2015 55
  56. 56. ODU WS-DL Projects Tools for Managing the Past Web • Archive Quality • Tweet Intention • TimeMap Summaries • Archive What I See Now • Storytelling for Archives February 20, 2015 56
  57. 57. Web Sciences and Digital Libraries Group (WS-DL) • Scott Ainsworth • Sawood Alam • Lulwah Alkwai • Yasmin AlNoamany • Mohamed Aturban • Justin Brunelle • Mat Kelly • Corren McCoy • Shawn Jones • Amara Naas • Louis Nguyen • Alexander Nwala • Hany SalahEldeen @WebSciDL http://ws-dl.cs.odu.edu/ http://ws-dl.blogspot.com/ Dr. Michele C. Weigle mweigle@cs.odu.edu @weiglemc http://www.cs.odu.edu/~mweigle/ February 20, 2015 57 Faculty • Dr. Michael L. Nelson • Dr. Michele C. Weigle PhD Students

Tools for Managing the Past Web ODU - ECE Seminar February 20, 2015 Presented by Michele Weigle

Views

Total views

986

On Slideshare

0

From embeds

0

Number of embeds

3

Actions

Downloads

7

Shares

0

Comments

0

Likes

0

×