Presentation at IIPC 2016 conference, Reykjavik, Iceland, 14 April 2016. Abstract:
Web archiving institutions have jointly harvested Petabytes of archived web content, in potential an exceptionally rich data source for researchers across the globe. These web archives are multidimensional by nature. First, a temporal dimension arises from different versions of web content accumulated over time. Second, a hierarchical dimension is implied as web archives may be examined at different analytical levels (Brügger, 2010), examples include the level of the web sphere, website and web page.
Scholars often focus their analysis on a specific analytical level and temporal range, for example looking at electoral web spheres at election times (Xenos and Bennet, 2007) or hyperlinking in news websites across time (Karlsson et al, 2015). However, we claim that this scholarly practice is not well supported by current web archive access tools, that usually allow only access at the page level and do not offer insights into the temporal development of broader selections of archived Web content, such as web spheres or websites. Hence, there is a need for more flexible access services in a research context.
In this presentation, we conceptually and practically explore how to address this mismatch. We illustrate how the temporal dimension can be harnessed by aggregating web content using different time ranges and the hierarchical dimension accommodated by novel aggregation support. Utilizing a concrete use case, we illustrate the potential usefulness of these representations of aggregated Web content. We analyze and compare the temporal evolution of various categories of websites in the Dutch Web Archive (such as news, history-related and government websites) across a five-year period. In this analysis, we look at the evolution of textual content, internal structure and image content across categories and websites. Finally, our presentation indicates how these types of aggregated representations may be integrated into future search systems for Web archives.
Towards Multidimensional Web Archive Access (IIPC 2016)
1. Towards Multidimensional Web Archive Access
Creating & Analyzing Representations of Aggregated Web Content
Hugo Huurdeman Thaer Samar Jaap Kamps Arjen deVries
2. Introduction
• Web archives:
• exceptionally rich potential scholarly data source
• Important: temporal & hierarchical aspects
• however, current access usually at single page level
3. Introduction
• Focus: how can we provide insights
into the multidimensional aspects of
the archive?
• i.e. moving from singular representations of
time-stamped pages to larger aggregated
representations
• Illustrated by previous work on scholarly
access & examples from Dutch Web
archive
Web
archive
5. 1.1 Exploratory Study
• Exploratory analysis of
scholars’ research tasks
(journal papers)
[see: Huurdeman15]
• scholars using temporal
Web data
• Focus on corpus generation, analysis
and dissemination
artist:
6. 1.1 Exploratory Study
• Method:
• querying EBSCOhost using the CMMC
(Communication & Mass Media Complete), and LISTA
(Library, Information Science &Technology Abstracts)
databases
• selecting all journal papers (2007-2015) which contain
longitudinal analyses (excl. computer science papers)
7. 1.2 Results: Scholars’ Corpora
• Observation:
• Of the 18 resulting papers, most scholars did
not use institutional Web archives as their data
source
• Corpus definition:
• 1. by selecting webpages or websites, e.g. based
on authoritative lists (13)
• 2. by querying regular search engines (5)
• 3. by taking a sample of webpages (4)
• or a combination thereof
8. 1.3 Results: Dimensions
• Some research examples:
• quality of answers in question-answering sites over time
(Chua et al, 2013)
• hyperlinking in news websites across time (Karlsson et al,
2015)
• electoral web spheres at election times (Xenos & Bennet,
2007)
• Various hierarchical and temporal dimensions
9. 1.3.1 Results: Hierarchical Dimension
• Level of analysis:
(b/o Brügger, 2013)
• page element (4) (22%)
• e.g. mission statements
• web page (6) (33%)
• e.g. blog pages
• web site* (7) (39%)
• e.g. political actors’ sites
• web sphere (1) (6%)
• e.g. electoral web sphere
web sphere (1)
website (7)
page element (4)
webpage (8)
11. 1.3 Dimensions:Wrapup
• Scholars’ focus: not just on pages, but also on
page elements, web sites and web spheres
• at timepoints, singular timerange, multiple timeranges
• Various ways to define a corpus
• queries, samples and selections (e.g. URL lists)
• How are these needs reflected in Web archive
data and access functionality?
16. 2.2 Data: Dimensions
• (2) hierarchical dimension
• “web sphere, web site, web page, page element”
• stored in (W)ARC files
• as “flat”(W)ARC records
17. Web sphere
Website
Page
Ele-
ment
Website Website Website
Page Page Page Page Page Page Page
Ele-
ment
Ele-
ment
Ele-
ment
Ele-
ment
Ele-
ment
Ele-
ment
Ele-
ment
Ele-
ment
Ele-
ment
Ele-
ment
Ele-
ment
Ele-
ment
Ele-
ment
Ele-
ment
Ele-
ment
eg, all pages under
a host or domain;
all homepages;
all homepages+1
eg, set of websites;
category of sites
eg, .css, .jpg file
Issue: delineating the granularities
18. 2.3 Access: current limits
• Open question: how to support these dimensions?
• current support in interfaces:
• most: Selecting URLs, timestamps (Wayback Machine)
• many: Querying contents of the archive, temporal filters
• few: Selecting categories, facet filters
• usually still page-level results, i.e. individual pages
• How to provide aggregated results using different
hierarchical and temporal dimensions?
• scaling from page to site and ‘sphere’ level
• moving from single timestamp to time periods
22. 3.1 Data: extraction and processing
extracting all homepages + all
pages 1 level deep
matching with seedlist
adding KB metadata
cleaning, processing, data
enrichment (e.g. NER)
generate aggregations~900K XML
files
34. 3.2.2 Exploring Content Summaries
• Examine textual contents of a website
• for example, nu.nl
• most popular Dutch
news site (Alexa, 2016)
• daily crawls by KB
• Exploration: different temporal site-level
summarizations
41. 4.1 Conclusion
• Gap between researchers needs and data/access
• Researchers’ needs
• rich access, e.g. different analytical levels, temporal ranges
• Archive access
• mainly access at single page level (URLs and queries)
• Calls for new approaches to provide
access to aggregated contents
• temporally and hierarchically
42. 4.2 Our approach
• Starting from a selection instead of a query
• Potential support exploratory stages of (re)search
• Potential support analysis and comparisons
• Issues: which levels of a website to summarize
• experimental focus on homepages and underlying pages
• deeper layers: additional richness, additional issues
• custom file formats vs standardized formats
• Integration into access interfaces
43. Web Archive
4.3 Ongoing and Future Work
• Further extending our approach;
integration into WebARTist toolset
• providing new ways to explore material
in the archive (without using queries)
• Creating aggregated representations
of unarchived contents
• see “Lost but Not Forgotten: Finding
Pages on the Unarchived Web” (2015)
“Corpus Creation”
“Analysis”
“Dissemination”
44. References
• Ben-David A. & Huurdeman H. (2014). Web Archive Search as Research: Methodological and Theoretical
Implications. Alexandria Journal, Volume 25, No. 1 (2014)
• Brügger, N. (2013). Historical Network Analysis of the Web. Social Science Computer Review, 31(3), 306–321
• Brügger, N. (2014). Concluding Remarks. International Internet Preservation Consortium General
Consortium. Paris, France. Retrieved from: http://netpreserve.org/sites/default/files/attachments/Brugger.ppt
(April 19, 2015)
• Chu, C. M. (1999). Literary critics at work and their information needs: A research-phases model. Library &
Information Science Research, 21(2), 247–273.
• Dougherty, M., & Meyer, E. T. (2014). Community, tools, and practices in web archiving: The state-of-the-art in
relation to social science and humanities research needs. Journal of the Association for Information Science
and Technology, 65(11), 2195–2209. http://doi.org/10.1002/asi.23099
• Hockx-Yu, H. (2014). Access and Scholarly Use of Web Archives. Alexandria, 25(1-2), 113–127.
• Huurdeman, H. (2015). Towards Research Engines: Supporting Search Stages in Web archives. Presented at
Web Archives as Scholarly Sources conference, Aarhus University, Denmark.
• Huurdeman H., Kamps J., Samar T., de Vries A., Ben-David A., Rogers R. (2015). Finding Pages in the
Unarchived Web. International Journal on Digital Libraries.
• Huurdeman, H., & Kamps, J. (2014). From Multistage Information-seeking Models to Multistage Search
Systems. In Proceedings of the 5th Information Interaction in Context Symposium (pp. 145–154). New York,
NY, USA: ACM.
• Meho, L. I., & Tibbo, H. R. (2003). Modeling the information-seeking behavior of social scientists: Ellis’s study
revisited. Journal of the American Society for Information Science and Technology, 54(6), 570–587.
• Rogers R. (2013). Digital Methods. MIT Press 2013
45. Thanks & Acknowledgements
• The WebART team (’12-’16):
Jaap Kamps, Richard Rogers,
Arjen de Vries, Hugo Huurdeman,
Thaer Samar, Anat Ben-David,
Sanna Kumpulainen
• We gratefully acknowledge the
collaboration with the Dutch Web
Archive of the National Library of the
Netherlands.
• This research was supported by the
Netherlands Organization for Scientific
Research (WebART project, NWO
CATCH # 640.005.001).
48. Towards Multidimensional Web Archive Access
Creating & Analyzing Representations of Aggregated Web Content
Hugo Huurdeman Thaer Samar Jaap Kamps Arjen deVries
huurdeman@uva.nl
@timelessfuture