Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Focused Crawl of Web Archives to Build Event Collections

286 views

Published on

Presentation at WebSci 2018
https://doi.org/10.1145/3201064.3201085

Published in: Internet
  • Be the first to comment

Focused Crawl of Web Archives to Build Event Collections

  1. 1. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL Focused Crawl of Web Archives to Build Event Collections Martin Klein Lyudmila Balakireva Herbert Van de Sompel Research Library Los Alamos National Laboratory
  2. 2. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 2 • Often orchestrated by subject matter experts, archivists, special collection librarians, technicians • Potentially with guidance from institutional collection policy • Results in a list of seeds (URIs, social media accounts, etc) • Utilization of crawling services such as Archive-It, Social Feed Manager Background – Event Collection Building
  3. 3. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 3 • Temporal: time passed since event is of concern  Use of web archives • Selection: seeds often picked manually  Use of references from Wikipedia pages • Relevance: seed assessment often done by humans  Use of focused crawling with content and temporal relevance assessment Inspiration from: “Extracting Event-Centric Document Collections from Large-Scale Web Archives” Gerhard Gossen, Elena Demidova, Thomas Risse https://doi.org/10.1007/978-3-319-67008-9_10 Problems and our Approach
  4. 4. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 4 • Web archives are an invaluable resource for researchers, historians, journalists, etc. • Often broad in scope, large in scale, covering different temporal intervals • Makes discovery, access, and analysis difficult Background – Archived Web
  5. 5. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 5
  6. 6. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 6
  7. 7. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 7 • Can we create event collections by focused crawling online- available web archives? • How do event collections created from the archived web compare to those created from the live web? • How does the amount of time passed since the event affect the collections built from the live and the archived web? • How do event collections built from the archived web compare to manually curated collections? Questions
  8. 8. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 8 • Topics limited to terror attacks and mass shootings in the U.S. • From different times in the past • Focused crawl of: a) 22 archives, simultaneously, via Memento infrastructure b) the live web • Take content and temporal relevance into account, equally weighted • Use events’ Wikipedia page as input for focused crawler Experiment
  9. 9. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 9 1. Content of Wikipedia page + random 60% of page’s references • Generate topic vector (TF-IDF of 1grams + 2grams) 2. Content of remaining 40% of Wikipedia page’s outlinks • Generate topic vector (TF-IDF of 1grams + 2grams) • Compute cosine similarity value between vectors 1 and 2 • Run 10 times • Take average similarity value as content threshold Content Relevance
  10. 10. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 10 • Define temporal interval for which crawled pages are considered relevant • Event date extracted from Wikipedia event page • Change point determined from graph of proportional Wikipedia page edits per day Temporal Relevance 1 Event Date Change Point Today 0 0
  11. 11. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 11 • Extract datetime from pages via: • URI http://www.cnn.com/2017/12/09/us/wildfire-fighting-tactics/ • Meta tags <meta property="article:published" itemprop="datePublished" content="2017-12-09T10:14:50-05:00" /> • ODU’s Carbondate tool http://carbondate.cs.odu.edu/ • Memento datetime • X-Header Datetime Extraction
  12. 12. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 12 • Use version of Wikipedia page that was live at change point • Crawl stop conditions: • No more relevant documents left • 5 levels deep • Utilized crawl priority queue Crawls Level 2 Level 1 Level 0 Child 1 Seed Child 2 Child 3 Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2
  13. 13. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 13 • New York City, October 31st 2017 • San Bernadino, December 2nd 2015 • Tucson, January 8th 2011 • Binghampton, April 3rd 2009 Collections Crawled (in November 2017)
  14. 14. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 14 NYC, 10/31/2017 – URIs per Level 0 1 2 3 4 5 Crawl depth NumberofURIs 0500100015002000 Web Archive Crawl 0102030405060708090100 All URIs Relevant URIs 0 1 2 3 4 5 Crawl depth 0500100015002000 Live Web Crawl 0102030405060708090100 Percent All URIs Relevant URIs
  15. 15. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 15 TUC, 01/08/2011 – URIs per Level 0 1 2 3 4 5 Crawl depth NumberofURIs 020000400006000080000 Web Archive Crawl 0102030405060708090100 All URIs Relevant URIs 0 1 2 3 4 5 Crawl depth 020000400006000080000 Live Web Crawl 0102030405060708090100 Percent All URIs Relevant URIs
  16. 16. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 16 NYC, 10/31/2017 – Relevance over… Crawled Documents Crawl Time
  17. 17. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 17 TUC, 01/08/2011 – Relevance over… Crawled Documents Crawl Time
  18. 18. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 18 TUC, 01/08/2011 – Comparison to Archive-IT 0 5000 10000 15000 050001000015000 Documents AccumulatedRelevance Web Archive Crawl Archive−It Crawl
  19. 19. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 19 TUC, 01/08/2011 – Web Archive Contributions web.archive.org 75% wayback.archive−it.org 14% webarchive.loc.gov 7% web.archive.bibalex.org 2% archive.is 2%
  20. 20. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 20 • Web archives are great resources to build event collections of web resources • Crawling web archives is much slower than the live web • Collections about very recent events benefit more from the live web than the archived web but • Collections about events from the distant past benefit more from the archived web than the live web • Utilizing multiple web archives is beneficial for the collection • Focused crawls have the potential to outperform manual collection building Takeaways
  21. 21. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 21 https://web.archive.org/web/20171206181955/https:/twitter.com/TVNewsArchive/status/938466726190096384
  22. 22. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL Focused Crawl of Web Archives to Build Event Collections Martin Klein Lyudmila Balakireva Herbert Van de Sompel Research Library Los Alamos National Laboratory

×