Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Focused Crawl of Web Archives to Build Event Collections

586 views

Published on

Presentation at WebSci 2018
https://doi.org/10.1145/3201064.3201085

Published in: Internet
  • You have to choose carefully. ⇒ www.HelpWriting.net ⇐ offers a professional writing service. I highly recommend them. The papers are delivered on time and customers are their first priority. This is their website: ⇒ www.HelpWriting.net ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • If we are speaking about saving time and money this site ⇒ www.WritePaper.info ⇐ is going to be the best option!! I personally used lots of times and remain highly satisfied.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • The 3 Secrets To Your Bulimia Recovery ♣♣♣ http://ishbv.com/bulimiarec/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THIS can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THIS is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THIS Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THIS the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THIS Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • The 3 Secrets To Your Bulimia Recovery ■■■ http://scamcb.com/bulimiarec/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Focused Crawl of Web Archives to Build Event Collections

  1. 1. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL Focused Crawl of Web Archives to Build Event Collections Martin Klein Lyudmila Balakireva Herbert Van de Sompel Research Library Los Alamos National Laboratory
  2. 2. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 2 • Often orchestrated by subject matter experts, archivists, special collection librarians, technicians • Potentially with guidance from institutional collection policy • Results in a list of seeds (URIs, social media accounts, etc) • Utilization of crawling services such as Archive-It, Social Feed Manager Background – Event Collection Building
  3. 3. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 3 • Temporal: time passed since event is of concern  Use of web archives • Selection: seeds often picked manually  Use of references from Wikipedia pages • Relevance: seed assessment often done by humans  Use of focused crawling with content and temporal relevance assessment Inspiration from: “Extracting Event-Centric Document Collections from Large-Scale Web Archives” Gerhard Gossen, Elena Demidova, Thomas Risse https://doi.org/10.1007/978-3-319-67008-9_10 Problems and our Approach
  4. 4. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 4 • Web archives are an invaluable resource for researchers, historians, journalists, etc. • Often broad in scope, large in scale, covering different temporal intervals • Makes discovery, access, and analysis difficult Background – Archived Web
  5. 5. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 5
  6. 6. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 6
  7. 7. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 7 • Can we create event collections by focused crawling online- available web archives? • How do event collections created from the archived web compare to those created from the live web? • How does the amount of time passed since the event affect the collections built from the live and the archived web? • How do event collections built from the archived web compare to manually curated collections? Questions
  8. 8. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 8 • Topics limited to terror attacks and mass shootings in the U.S. • From different times in the past • Focused crawl of: a) 22 archives, simultaneously, via Memento infrastructure b) the live web • Take content and temporal relevance into account, equally weighted • Use events’ Wikipedia page as input for focused crawler Experiment
  9. 9. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 9 1. Content of Wikipedia page + random 60% of page’s references • Generate topic vector (TF-IDF of 1grams + 2grams) 2. Content of remaining 40% of Wikipedia page’s outlinks • Generate topic vector (TF-IDF of 1grams + 2grams) • Compute cosine similarity value between vectors 1 and 2 • Run 10 times • Take average similarity value as content threshold Content Relevance
  10. 10. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 10 • Define temporal interval for which crawled pages are considered relevant • Event date extracted from Wikipedia event page • Change point determined from graph of proportional Wikipedia page edits per day Temporal Relevance 1 Event Date Change Point Today 0 0
  11. 11. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 11 • Extract datetime from pages via: • URI http://www.cnn.com/2017/12/09/us/wildfire-fighting-tactics/ • Meta tags <meta property="article:published" itemprop="datePublished" content="2017-12-09T10:14:50-05:00" /> • ODU’s Carbondate tool http://carbondate.cs.odu.edu/ • Memento datetime • X-Header Datetime Extraction
  12. 12. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 12 • Use version of Wikipedia page that was live at change point • Crawl stop conditions: • No more relevant documents left • 5 levels deep • Utilized crawl priority queue Crawls Level 2 Level 1 Level 0 Child 1 Seed Child 2 Child 3 Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2
  13. 13. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 13 • New York City, October 31st 2017 • San Bernadino, December 2nd 2015 • Tucson, January 8th 2011 • Binghampton, April 3rd 2009 Collections Crawled (in November 2017)
  14. 14. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 14 NYC, 10/31/2017 – URIs per Level 0 1 2 3 4 5 Crawl depth NumberofURIs 0500100015002000 Web Archive Crawl 0102030405060708090100 All URIs Relevant URIs 0 1 2 3 4 5 Crawl depth 0500100015002000 Live Web Crawl 0102030405060708090100 Percent All URIs Relevant URIs
  15. 15. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 15 TUC, 01/08/2011 – URIs per Level 0 1 2 3 4 5 Crawl depth NumberofURIs 020000400006000080000 Web Archive Crawl 0102030405060708090100 All URIs Relevant URIs 0 1 2 3 4 5 Crawl depth 020000400006000080000 Live Web Crawl 0102030405060708090100 Percent All URIs Relevant URIs
  16. 16. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 16 NYC, 10/31/2017 – Relevance over… Crawled Documents Crawl Time
  17. 17. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 17 TUC, 01/08/2011 – Relevance over… Crawled Documents Crawl Time
  18. 18. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 18 TUC, 01/08/2011 – Comparison to Archive-IT 0 5000 10000 15000 050001000015000 Documents AccumulatedRelevance Web Archive Crawl Archive−It Crawl
  19. 19. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 19 TUC, 01/08/2011 – Web Archive Contributions web.archive.org 75% wayback.archive−it.org 14% webarchive.loc.gov 7% web.archive.bibalex.org 2% archive.is 2%
  20. 20. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 20 • Web archives are great resources to build event collections of web resources • Crawling web archives is much slower than the live web • Collections about very recent events benefit more from the live web than the archived web but • Collections about events from the distant past benefit more from the archived web than the live web • Utilizing multiple web archives is beneficial for the collection • Focused crawls have the potential to outperform manual collection building Takeaways
  21. 21. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL 21 https://web.archive.org/web/20171206181955/https:/twitter.com/TVNewsArchive/status/938466726190096384
  22. 22. Focused Crawl of Web Archives to Build Event Collections @mart1nkle1n WebSci 2018, 05/30/2018, Amsterdam, NL Focused Crawl of Web Archives to Build Event Collections Martin Klein Lyudmila Balakireva Herbert Van de Sompel Research Library Los Alamos National Laboratory

×