Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Towards Multidimensional Web Archive Access (IIPC 2016)

645 views

Published on

Presentation at IIPC 2016 conference, Reykjavik, Iceland, 14 April 2016. Abstract:
Web archiving institutions have jointly harvested Petabytes of archived web content, in potential an exceptionally rich data source for researchers across the globe. These web archives are multidimensional by nature. First, a temporal dimension arises from different versions of web content accumulated over time. Second, a hierarchical dimension is implied as web archives may be examined at different analytical levels (Brügger, 2010), examples include the level of the web sphere, website and web page.
Scholars often focus their analysis on a specific analytical level and temporal range, for example looking at electoral web spheres at election times (Xenos and Bennet, 2007) or hyperlinking in news websites across time (Karlsson et al, 2015). However, we claim that this scholarly practice is not well supported by current web archive access tools, that usually allow only access at the page level and do not offer insights into the temporal development of broader selections of archived Web content, such as web spheres or websites. Hence, there is a need for more flexible access services in a research context.
In this presentation, we conceptually and practically explore how to address this mismatch. We illustrate how the temporal dimension can be harnessed by aggregating web content using different time ranges and the hierarchical dimension accommodated by novel aggregation support. Utilizing a concrete use case, we illustrate the potential usefulness of these representations of aggregated Web content. We analyze and compare the temporal evolution of various categories of websites in the Dutch Web Archive (such as news, history-related and government websites) across a five-year period. In this analysis, we look at the evolution of textual content, internal structure and image content across categories and websites. Finally, our presentation indicates how these types of aggregated representations may be integrated into future search systems for Web archives.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Towards Multidimensional Web Archive Access (IIPC 2016)

  1. 1. Towards Multidimensional Web Archive Access 
 
 Creating & Analyzing Representations of Aggregated Web Content Hugo Huurdeman Thaer Samar Jaap Kamps Arjen deVries
  2. 2. Introduction • Web archives: • exceptionally rich potential scholarly data source • Important: temporal & hierarchical aspects • however, current access usually at single page level
  3. 3. Introduction • Focus: how can we provide insights into the multidimensional aspects of the archive? • i.e. moving from singular representations of time-stamped pages to larger aggregated representations • Illustrated by previous work on scholarly access & examples from Dutch Web archive Web archive
  4. 4. Scholars’ Needs literature analysis1
  5. 5. 1.1 Exploratory Study • Exploratory analysis of scholars’ research tasks (journal papers) 
 [see: Huurdeman15] • scholars using temporal Web data • Focus on corpus generation, analysis and dissemination artist:
  6. 6. 1.1 Exploratory Study • Method: • querying EBSCOhost using the CMMC (Communication & Mass Media Complete), and LISTA (Library, Information Science &Technology Abstracts) databases • selecting all journal papers (2007-2015) which contain longitudinal analyses (excl. computer science papers)
  7. 7. 1.2 Results: Scholars’ Corpora • Observation: • Of the 18 resulting papers, most scholars did not use institutional Web archives as their data source • Corpus definition: • 1. by selecting webpages or websites, e.g. based on authoritative lists (13) • 2. by querying regular search engines (5) • 3. by taking a sample of webpages (4) • or a combination thereof
  8. 8. 1.3 Results: Dimensions • Some research examples: • quality of answers in question-answering sites over time (Chua et al, 2013) • hyperlinking in news websites across time (Karlsson et al, 2015) • electoral web spheres at election times (Xenos & Bennet, 2007) • Various hierarchical and temporal dimensions
  9. 9. 1.3.1 Results: Hierarchical Dimension • Level of analysis:
 (b/o Brügger, 2013) • page element (4) (22%) • e.g. mission statements • web page (6) (33%) • e.g. blog pages • web site* (7) (39%) • e.g. political actors’ sites • web sphere (1) (6%) • e.g. electoral web sphere web sphere (1) website (7) page element (4) webpage (8)
  10. 10. 1.3.2 Results: Temporal Dimension 2000 2005 2010 timepoints singular
 timerange multiple timeranges } 5 (28%) 8 (44%) 5 (28%) #Papers
  11. 11. 1.3 Dimensions:Wrapup • Scholars’ focus: not just on pages, but also on page elements, web sites and web spheres • at timepoints, singular timerange, multiple timeranges • Various ways to define a corpus • queries, samples and selections (e.g. URL lists) • How are these needs reflected in Web archive data and access functionality?
  12. 12. Dimensions of the Web archive data and access2
  13. 13. 2.1 Web Archive Data •Usually stored in (W)ARC files • each containing one or more (W)ARC records • resources of various kinds
  14. 14. 2.1 Data: Dimensions • (1) temporal dimension • versions of Web content accumulated over time • timestamped (W)ARC records • crawl dates • last-modified dates
  15. 15. 2000 2016 20041997 2008 20122008
  16. 16. 2.2 Data: Dimensions • (2) hierarchical dimension • “web sphere, web site, web page, page element” • stored in (W)ARC files • as “flat”(W)ARC records
  17. 17. Web sphere Website Page Ele- ment Website Website Website Page Page Page Page Page Page Page Ele- ment Ele- ment Ele- ment Ele- ment Ele- ment Ele- ment Ele- ment Ele- ment Ele- ment Ele- ment Ele- ment Ele- ment Ele- ment Ele- ment Ele- ment eg, all pages under 
 a host or domain;
 all homepages; all homepages+1 eg, set of websites;
 category of sites eg, .css, .jpg file Issue: delineating the granularities
  18. 18. 2.3 Access: current limits • Open question: how to support these dimensions? • current support in interfaces: • most: Selecting URLs, timestamps (Wayback Machine) • many: Querying contents of the archive, temporal filters • few: Selecting categories, facet filters • usually still page-level results, i.e. individual pages • How to provide aggregated results using different hierarchical and temporal dimensions? • scaling from page to site and ‘sphere’ level • moving from single timestamp to time periods
  19. 19. Web sphere Page element Web site Web page 2000 2005 2010
  20. 20. Exploring Aggregations Aggregated representations in the Dutch Web archive3
  21. 21. Flickr: koninklijkebibliotheek Statistics: •10,000+ websites •35,000+ harvests •16+Terabyte National Library of the Netherlands:Web archive since 2007
  22. 22. 3.1 Data: extraction and processing extracting all homepages + all pages 1 level deep matching with seedlist
 adding KB metadata cleaning, processing, data enrichment (e.g. NER) generate aggregations~900K XML
 files
  23. 23. Singlepage
  24. 24. Sitesummary Singlepages
  25. 25. 3.2 Potential Use: Explorations • Potential for analysis and visualization •Examples via Dutch Web archive • I. (aggregated) degree of change — hierarchical • homepages+1, ssdeep (content text, links, images) • II. (aggregated) content summaries — temporal • homepages + 1, tf-idf
  26. 26. 3.2.1 Examining aggregated degree of change
  27. 27. Web sphere Page element Web site Web page 2010 2015 eyefilm.nl
  28. 28. 0" 20" 40" 60" 80" 100" 120" 20100722" 20100816" 20100817" 20110413" 20110610" 20110706" 20111013" 20111218" 20111220" 20120520" 20120613" 20120617" 20120618" 20120918" 20121014" 20121120" 20121221" 20121222" 20121222" 20130218" 20130413" 20130518" 20130611" 20130620" 20130818" 20131001" 20131013" 20131030" 20131101" 20131115" 20131118" 20131120" 20131130" 20131206" 20131220" 20131220" 20140118" 20140225" 20140413" 20140518" 20140609" 20141013" 20141118" 20150218" 20150413" 20150518" Reeks1" Reeks2" Reeks3" Reeks4" Example: eyefilm.nl (2010-2015) redesign redesign content links images overall
  29. 29. 0" 10" 20" 30" 40" 50" 60" 70" 80" 90" 100" 20090226"20091110"20100204"20100210"20100510"20100804"20100810"20101110"20110206"20110211"20110510"20110706"20110802"20110810"20111110"20120202"20120210"20120510"20120802"20120810"20121110"20130210"20130510"20130810"20131110"20140210"20140821"20141110"20150210"20150510" Reeks1" Reeks2" Reeks3" Reeks4" Example: escherinhetpaleis.nl (2010-2015) 0" 20" 40" 60" 80" 100" 120" 20100722" 20100816" 20100817" 20110413" 20110610" 20110706" 20111013" 20111218" 20111220" 20120520" 20120613" 20120617" 20120618" 20120918" 20121014" 20121120" 20121221" 20121222" 20121222" 20130218" 20130413" 20130518" 20130611" 20130620" 20130818" 20131001" 20131013" 20131030" 20131101" 20131115" 20131118" 20131120" 20131130" 20131206" 20131220" 20131220" 20140118" 20140225" 20140413" 20140518" 20140609" 20141013" 20141118" 20150218" 20150413" 20150518" Reeks1" Reeks2" Reeks3" Reeks4"content links images overall
  30. 30. Web sphere Page element Web site Web page 2010 2015 unesco classifications
  31. 31. Changerate (type of site) 0" 10" 20" 30" 40" 50" 60" 01" 02" 03" 04" 05" 06" 08" 09" 16" 17" 18" 19" 20" 22" 23" 24" 25" 30" 31" Gemiddeld"van"conte Gemiddeld"van"image Gemiddeld"van"links" Gemiddeld"van"comb Changes per unesco category (all p/quarter harvests, n=~600, 2009-2015) Meteorology Law & government History Sports Agriculture 01" 02" 03" 04" 05" 06" 08" 09" 16" 17" 18" 19" 20" 22" 23" 24" 25" 30" 31" Gemiddeld"van"content" Gemiddeld"van"images" Gemiddeld"van"links" Gemiddeld"van"combined"
  32. 32. Changerate (all sites) 0" 5" 10" 15" 20" 25" 30" 35" 2009Q3"2009Q4"2010Q1"2010Q2"2010Q3"2010Q4"2011Q1"2011Q2"2011Q3"2011Q4"2012Q1"2012Q2"2012Q3"2012Q4"2013Q1"2013Q2"2013Q3"2013Q4"2014Q1"2014Q2"2014Q3"2014Q4"2015Q1" Gemiddeld"van"content" Gemiddeld"van"links" Gemiddeld"van"images" Gemiddeld"van"combined" Changerate (all p/quarter harvests, 2009-2015) 0" 5" 10" 15" 20" 25" 30" 35" 2009Q3"2009Q4"2010Q1"2010Q2"2010Q3"2010Q4"2011Q1"2011Q2"2011Q3"2011Q4"2012Q1"2012Q2"2012Q3"2012Q4"2013Q1"2013Q2"2013Q3"2013Q4"2014Q1"2014Q2"2014Q3"2014Q4"2015Q1" Gemiddeld"van"content" Gemiddeld"van"links" Gemiddeld"van"images" Gemiddeld"van"combined"
  33. 33. 3.2.2 Examining aggregated content summaries
  34. 34. 3.2.2 Exploring Content Summaries • Examine textual contents of a website • for example, nu.nl • most popular Dutch 
 news site (Alexa, 2016) • daily crawls by KB • Exploration: different temporal site-level summarizations
  35. 35. 2014 2015
  36. 36. Jan’13 Feb’13 Mar’13 Apr’13 May’13 Jun’13 Jul’13 Aug’13 Sep’13 Oct’13 Nov’13 Dec’13
  37. 37. Daily (2012)
  38. 38. Organizations (NER) 201420132012 Persons (NER) 2013 2014 2015 Places (NER)
  39. 39. 0" 20" 40" 60" 80" 100" 120" 20100722" 20100816" 20100817" 20110413" 20110610" 20110706" 20111013" 20111218" 20111220" 20120520" 20120613" 20120617" 20120618" 20120918" 20121014" 20121120" 20121221" 20121222" 20121222" 20130218" 20130413" 20130518" 20130611" 20130620" 20130818" 20131001" 20131013" 20131030" 20131101" 20131115" 20131118" 20131120" 20131130" 20131206" 20131220" 20131220" 20140118" 20140225" 20140413" 20140518" 20140609" 20141013" 20141118" 20150218" 20150413" 20150518" Reeks1" Reeks2" Reeks3" Reeks4" 3.2.3 Next: combining approaches
  40. 40. Conclusion Towards Multidimensional Web Archive Access4
  41. 41. 4.1 Conclusion • Gap between researchers needs and data/access • Researchers’ needs • rich access, e.g. different analytical levels, temporal ranges • Archive access • mainly access at single page level (URLs and queries) • Calls for new approaches to provide 
 access to aggregated contents • temporally and hierarchically
  42. 42. 4.2 Our approach • Starting from a selection instead of a query • Potential support exploratory stages of (re)search • Potential support analysis and comparisons • Issues: which levels of a website to summarize • experimental focus on homepages and underlying pages • deeper layers: additional richness, additional issues • custom file formats vs standardized formats • Integration into access interfaces
  43. 43. Web Archive 4.3 Ongoing and Future Work • Further extending our approach; integration into WebARTist toolset • providing new ways to explore material in the archive (without using queries) • Creating aggregated representations of unarchived contents • see “Lost but Not Forgotten: Finding Pages on the Unarchived Web” (2015) “Corpus Creation” “Analysis” “Dissemination”
  44. 44. References • Ben-David A. & Huurdeman H. (2014). Web Archive Search as Research: Methodological and Theoretical Implications. Alexandria Journal, Volume 25, No. 1 (2014) • Brügger, N. (2013). Historical Network Analysis of the Web. Social Science Computer Review, 31(3), 306–321 • Brügger, N. (2014). Concluding Remarks. International Internet Preservation Consortium General Consortium. Paris, France. Retrieved from: http://netpreserve.org/sites/default/files/attachments/Brugger.ppt (April 19, 2015) • Chu, C. M. (1999). Literary critics at work and their information needs: A research-phases model. Library & Information Science Research, 21(2), 247–273. • Dougherty, M., & Meyer, E. T. (2014). Community, tools, and practices in web archiving: The state-of-the-art in relation to social science and humanities research needs. Journal of the Association for Information Science and Technology, 65(11), 2195–2209. http://doi.org/10.1002/asi.23099 • Hockx-Yu, H. (2014). Access and Scholarly Use of Web Archives. Alexandria, 25(1-2), 113–127. • Huurdeman, H. (2015). Towards Research Engines: Supporting Search Stages in Web archives. Presented at Web Archives as Scholarly Sources conference, Aarhus University, Denmark. • Huurdeman H., Kamps J., Samar T., de Vries A., Ben-David A., Rogers R. (2015). Finding Pages in the Unarchived Web. International Journal on Digital Libraries. • Huurdeman, H., & Kamps, J. (2014). From Multistage Information-seeking Models to Multistage Search Systems. In Proceedings of the 5th Information Interaction in Context Symposium (pp. 145–154). New York, NY, USA: ACM. • Meho, L. I., & Tibbo, H. R. (2003). Modeling the information-seeking behavior of social scientists: Ellis’s study revisited. Journal of the American Society for Information Science and Technology, 54(6), 570–587. • Rogers R. (2013). Digital Methods. MIT Press 2013
  45. 45. Thanks & Acknowledgements • The WebART team (’12-’16): 
 Jaap Kamps, Richard Rogers, 
 Arjen de Vries, Hugo Huurdeman, Thaer Samar, Anat Ben-David, 
 Sanna Kumpulainen • We gratefully acknowledge the collaboration with the Dutch Web Archive of the National Library of the Netherlands. • This research was supported by the Netherlands Organization for Scientific Research (WebART project, NWO CATCH # 640.005.001).
  46. 46. webarchiving.nl @webart12
  47. 47. Towards Multidimensional Web Archive Access 
 
 Creating & Analyzing Representations of Aggregated Web Content Hugo Huurdeman Thaer Samar Jaap Kamps Arjen deVries huurdeman@uva.nl @timelessfuture

×