Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Tools for Managing the Past Web

0 views

Published on

Tools for Managing the Past Web
2014 Archive-It Partners Meeting
November 18, 2014
Presented by Michele Weigle

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Tools for Managing the Past Web

  1. 1. Tools for Managing the Past Web Michele C. Weigle Web Sciences and Digital Libraries (WS-DL) Group Department of Computer Science Old Dominion University Norfolk, VA Includes joint work with Michael L. Nelson and our PhD students, Yasmin AlNoamany, Ahmed AlSum (PhD 2014), Justin Brunelle, Mat Kelly, Hany SalahEldeen Archive-It Partners Meeting November 18, 2014
  2. 2. Outline Start-Up and Implementation Grants – WARCreate – WAIL – Mink – Assessing Memento Damage Web Archiving Incentive – Thumbnail Summarization – Detecting Off-Topic Mementos WARCreate WAIL Mink https://ws-dl.cs.odu.edu/Software November 18, 2014 Archive-It Partners Meeting 2
  3. 3. Archive What I See Now • Standard web archiving tools are difficult for non IT experts. • "Save Page As" is not suitable for archiving purposes. • Pages are behind authentication. • Pages change quickly, but current state needs archiving. NEH Digital Humanities Implementation Grant, 2014-2017, http://bit.ly/odu-dhig-2014 November 18, 2014 Archive-It Partners Meeting 3
  4. 4. How we're addressing the problem Google Chrome extension Archive the current state of the page in standard Web Archive (WARC) format Compatible with Wayback WARCreate Kelly and Weigle, "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage", JCDL 2012 Kelly, Weigle, and Nelson. "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage," Digital Preservation 2012, Tools Demo Session 4 November 18, 2014 Archive-It Partners Meeting
  5. 5. WARCreate - Work in Progress • New modes of operation – record mode • while activated, add capture of each page visited to the WARC – countdown mode • every interval, refresh and add new capture of page – event mode • add new capture of page every time it dynamically reloads or refreshes November 18, 2014 Archive-It Partners Meeting 5
  6. 6. WARCreate - Work in Progress • Uploading created WARCs to Archive-It or other archives – consideration of data integrity – merging local WARCs with crawled WARCs • how do we account for your www.facebook.com vs. my www.facebook.com? – privacy November 18, 2014 Archive-It Partners Meeting 6
  7. 7. What to do with created WARCs? WAIL Load created WARCs into a Wayback instance on your local computer Single-click install of Wayback (and other archiving tools) Includes IIPC's OpenWayback 2.0 and Heritrix 3.2 Available for Windows, OS X (Linux coming soon!) Kelly, Weigle, and Nelson. "Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving," Personal Digital Archiving 2013, Poster Session Kelly, Nelson, and Weigle. "WARCreate and WAIL: WARC, Wayback and Heritrix Made Easy," Digital Preservation 2013 November 18, 2014 Archive-It Partners Meeting 7
  8. 8. WAIL - Work in Progress • More tools – integration with Ilya Kreymer's pywb • User interface enhancements – ease of installation – intuitive GUI – configuration of Wayback display and Heritrix crawls November 18, 2014 Archive-It Partners Meeting 8
  9. 9. Bridging the gap between the past web and the live web Google Chrome extension For each page you visit, displays the number of archived versions available Provides access by date Allows for submission to public archiving services Mink Kelly, Nelson and Weigle, "Mink: Integrating the Live and Archived Web Viewing Experience Using Web Browsers and Memento," poster, ACM/IEEE Digital Libraries (DL), September 2014. November 18, 2014 Archive-It Partners Meeting 9
  10. 10. Mink - Work in Progress • Pick public archives (Memento Aggregator) or private archive (local computer) November 18, 2014 Archive-It Partners Meeting 10
  11. 11. Tools Archive-It Partners Meeting WARCreate Mink WAIL 11 https://ws-dl.cs.odu.edu/Software November 18, 2014
  12. 12. Outline Start-Up and Implementation Grants – WARCreate – WAIL – Mink – Assessing Memento Damage Web Archiving Incentive – Thumbnail Summarization – Detecting Off-Topic Mementos WAIL Mink https://ws-dl.cs.odu.edu/Software WARCreate November 18, 2014 Archive-It Partners Meeting 12
  13. 13. How damaged are these mementos? M = percentage missing D = our damage metric Archive-It Partners Meeting M = 0.17 D = 0.09 (live web) M = 0.24 D = 0.41 (missing main) M = 0.29 D = 0.36 (missing logo + navigation) Brunelle, Kelly, SalahEldeen, Weigle, and Nelson, "Not All Mementos Are Created Equal: Measuring the Impact of Missing Resources", IEEE/ACM Digital Libraries (DL) 2014, Best Student Paper November 18, 2014 13
  14. 14. Good News: Although M is steady/increasing, D is decreasing November 18, 2014 Archive-It Partners Meeting 14 M = percentage missing D = our damage metric Sampled 45,000 URI-Ms - one URI-M each year of ~1850 URI-Rs - URI-Rs from Bitly URIs shared over Twitter and Archive-It collections
  15. 15. Outline Start-Up and Implementation Grants – WARCreate – WAIL – Mink – Assessing Memento Damage Web Archiving Incentive – Thumbnail Summarization – Detecting Off-Topic Mementos WAIL Mink https://ws-dl.cs.odu.edu/Software WARCreate November 18, 2014 Archive-It Partners Meeting 15
  16. 16. Browsing TimeMaps How were these 4 thumbnails chosen? November 18, 2014 Archive-It Partners Meeting 16
  17. 17. Which tells you more about the past of www.apple.com? 700 thumbnails (not even all of them!) November 18, 2014 Archive-It Partners Meeting 32 sampled thumbnails 17 AlSum and Nelson, "Thumbnail Summarization Techniques for Web Archives", ECIR 2014
  18. 18. Thumbnail Summarization • Process – compare HTML of consecutive mementos • more efficient than image diff – when diff threshold passed, generate thumbnail – return data + thumbnail as JSON • Considerations – diff threshold too low -> near duplicate images – diff threshold too high -> miss important changes • Work in Progress – wayback plugin – embeddable version November 18, 2014 Archive-It Partners Meeting 18
  19. 19. Thumbnail Summary Screencast November 18, 2014 Archive-It Partners Meeting 19
  20. 20. Outline Start-Up and Implementation Grants – WARCreate – WAIL – Mink – Assessing Memento Damage Web Archiving Incentive – Thumbnail Summarization – Detecting Off-Topic Mementos WAIL Mink https://ws-dl.cs.odu.edu/Software WARCreate November 18, 2014 Archive-It Partners Meeting 20
  21. 21. Have you ever had this problem? May 21, 2012 May 16, 2013 nothing but spam November 18, 2014 Archive-It Partners Meeting 21
  22. 22. Detecting Off-Topic Mementos • Goal: Build a tool to alert curators of potential off-topic mementos in a collection • Compare text of mementos – Intersection of top terms (TF) – Cosine similarity – Jaccard similarity coefficient – Clustering with topic modeling November 18, 2014 Archive-It Partners Meeting 22
  23. 23. Test Collections November 18, 2014 Archive-It Partners Meeting 23
  24. 24. Turns out to be rather difficult • Egyptian Revolution – lots of non-English pages • Occupy Movement – lots of Facebook and social media pages – template extractors have trouble with these • Boston Marathon Bombing but we're making progress (stay tuned!) November 18, 2014 Archive-It Partners Meeting 24
  25. 25. Storytelling For Archives Storytelling services Archived collections Archived enriched stories AlNoamany, "Using Web Archives to Enrich the Live Web Experience Through Storytelling", TCDL Bulletin, December 2013. November 18, 2014 Archive-It Partners Meeting 25
  26. 26. Story Types Fixed Page – Fixed Time: differences in GeoIP, mobile, etc. Fixed Page – Sliding Time: evolution of a single page (or domain) through time Sliding Page – Fixed Time: different perspectives on a point in time Sliding Page – Sliding Time: broadest possible coverage of a collection same Time different URI same different Issues: topic modeling, eliminating duplicates, maximizing novelty, structural & content quality November 18, 2014 Archive-It Partners Meeting 26
  27. 27. Tools for Storytelling • Tools for Curators – create stories from your collections • candidate mementos automatically selected – use existing stories to augment your collections • Tools for Users – use existing tools like Storify to view the stories of a collection November 18, 2014 Archive-It Partners Meeting 27
  28. 28. Tools for Managing the Past Web Start-Up and Implementation Grants – WARCreate – WAIL – Mink – Assessing Memento Damage Web Archiving Incentive – Thumbnail Summarization – Detecting Off-Topic Mementos Web Science and Digital Libraries (WS-DL) Group @WebSciDL http://ws-dl.cs.odu.edu/ http://ws-dl.blogspot.com/ Michele C. Weigle mweigle@cs.odu.edu @weiglemc http://www.cs.odu.edu/~mweigle/ WAIL Mink WARCreate https://ws-dl.cs.odu.edu/Software November 18, 2014 Archive-It Partners Meeting 28

×