Towards Multidimensional Web Archive Access (IIPC 2016)

Towards Multidimensional Web Archive Access  
 
Creating & Analyzing Representations of Aggregated Web Content
Hugo Huurdeman Thaer Samar Jaap Kamps Arjen deVries

Introduction
• Web archives:
• exceptionally rich potential scholarly data source
• Important: temporal & hierarchical aspects
• however, current access usually at single page level

Introduction
• Focus: how can we provide insights
into the multidimensional aspects of
the archive?
• i.e. moving from singular representations of
time-stamped pages to larger aggregated
representations
• Illustrated by previous work on scholarly
access & examples from Dutch Web
archive
Web
archive

Scholars’ Needs
literature analysis1

1.1 Exploratory Study
• Exploratory analysis of
scholars’ research tasks
(journal papers)  
[see: Huurdeman15]
• scholars using temporal
Web data
• Focus on corpus generation, analysis
and dissemination
artist:

1.1 Exploratory Study
• Method:
• querying EBSCOhost using the CMMC
(Communication & Mass Media Complete), and LISTA
(Library, Information Science &Technology Abstracts)
databases
• selecting all journal papers (2007-2015) which contain
longitudinal analyses (excl. computer science papers)

1.2 Results: Scholars’ Corpora
• Observation:
• Of the 18 resulting papers, most scholars did
not use institutional Web archives as their data
source
• Corpus deﬁnition:
• 1. by selecting webpages or websites, e.g. based
on authoritative lists (13)
• 2. by querying regular search engines (5)
• 3. by taking a sample of webpages (4)
• or a combination thereof

1.3 Results: Dimensions
• Some research examples:
• quality of answers in question-answering sites over time
(Chua et al, 2013)
• hyperlinking in news websites across time (Karlsson et al,
2015)
• electoral web spheres at election times (Xenos & Bennet,
2007)
• Various hierarchical and temporal dimensions

1.3.1 Results: Hierarchical Dimension
• Level of analysis: 
(b/o Brügger, 2013)
• page element (4) (22%)
• e.g. mission statements
• web page (6) (33%)
• e.g. blog pages
• web site* (7) (39%)
• e.g. political actors’ sites
• web sphere (1) (6%)
• e.g. electoral web sphere
web sphere (1)
website (7)
page element (4)
webpage (8)

1.3.2 Results: Temporal Dimension
2000 2005 2010
timepoints
singular 
timerange
multiple
timeranges
}
5 (28%)
8 (44%)
5 (28%)
#Papers

1.3 Dimensions:Wrapup
• Scholars’ focus: not just on pages, but also on
page elements, web sites and web spheres
• at timepoints, singular timerange, multiple timeranges
• Various ways to deﬁne a corpus
• queries, samples and selections (e.g. URL lists)
• How are these needs reﬂected in Web archive
data and access functionality?

Dimensions of the Web archive
data and access2

2.1 Web Archive Data
•Usually stored in (W)ARC ﬁles
• each containing one or more (W)ARC records
• resources of various kinds

2.1 Data: Dimensions
• (1) temporal dimension
• versions of Web content accumulated over time
• timestamped (W)ARC records
• crawl dates
• last-modified dates

2000
2016
20041997
2008 20122008

2.2 Data: Dimensions
• (2) hierarchical dimension
• “web sphere, web site, web page, page element”
• stored in (W)ARC files
• as “ﬂat”(W)ARC records

Web sphere
Website
Page
Ele-
ment
Website Website Website
Page Page Page Page Page Page Page
Ele-
ment
Ele-
ment
Ele-
ment
Ele-
ment
Ele-
ment
Ele-
ment
Ele-
ment
Ele-
ment
Ele-
ment
Ele-
ment
Ele-
ment
Ele-
ment
Ele-
ment
Ele-
ment
Ele-
ment
eg, all pages under  
a host or domain; 
all homepages;
all homepages+1
eg, set of websites; 
category of sites
eg, .css, .jpg file
Issue: delineating the granularities

2.3 Access: current limits
• Open question: how to support these dimensions?
• current support in interfaces:
• most: Selecting URLs, timestamps (Wayback Machine)
• many: Querying contents of the archive, temporal filters
• few: Selecting categories, facet filters
• usually still page-level results, i.e. individual pages
• How to provide aggregated results using different
hierarchical and temporal dimensions?
• scaling from page to site and ‘sphere’ level
• moving from single timestamp to time periods

Web sphere
Page element
Web site
Web page
2000 2005 2010

Exploring Aggregations
Aggregated representations in the Dutch Web archive3

Flickr: koninklijkebibliotheek
Statistics:
•10,000+ websites
•35,000+ harvests
•16+Terabyte
National Library of the Netherlands:Web archive since 2007

3.1 Data: extraction and processing
extracting all homepages + all
pages 1 level deep
matching with seedlist 
adding KB metadata
cleaning, processing, data
enrichment (e.g. NER)
generate aggregations~900K XML 
files

3.2 Potential Use: Explorations
• Potential for analysis and visualization
•Examples via Dutch Web archive
• I. (aggregated) degree of change — hierarchical
• homepages+1, ssdeep (content text, links, images)
• II. (aggregated) content summaries — temporal
• homepages + 1, tf-idf

3.2.1 Examining aggregated degree of change

Web sphere
Page element
Web site
Web page
2010 2015
eyefilm.nl

0"
20"
40"
60"
80"
100"
120"
20100722"
20100816"
20100817"
20110413"
20110610"
20110706"
20111013"
20111218"
20111220"
20120520"
20120613"
20120617"
20120618"
20120918"
20121014"
20121120"
20121221"
20121222"
20121222"
20130218"
20130413"
20130518"
20130611"
20130620"
20130818"
20131001"
20131013"
20131030"
20131101"
20131115"
20131118"
20131120"
20131130"
20131206"
20131220"
20131220"
20140118"
20140225"
20140413"
20140518"
20140609"
20141013"
20141118"
20150218"
20150413"
20150518"
Reeks1" Reeks2" Reeks3" Reeks4"
Example: eyefilm.nl (2010-2015)
redesign redesign
content links images overall

0"
10"
20"
30"
40"
50"
60"
70"
80"
90"
100"
20090226"20091110"20100204"20100210"20100510"20100804"20100810"20101110"20110206"20110211"20110510"20110706"20110802"20110810"20111110"20120202"20120210"20120510"20120802"20120810"20121110"20130210"20130510"20130810"20131110"20140210"20140821"20141110"20150210"20150510"
Example: escherinhetpaleis.nl (2010-2015)
0"
20"
40"
60"
80"
100"
120"
20100722"
20100816"
20100817"
20110413"
20110610"
20110706"
20111013"
20111218"
20111220"
20120520"
20120613"
20120617"
20120618"
20120918"
20121014"
20121120"
20121221"
20121222"
20121222"
20130218"
20130413"
20130518"
20130611"
20130620"
20130818"
20131001"
20131013"
20131030"
20131101"
20131115"
20131118"
20131120"
20131130"
20131206"
20131220"
20131220"
20140118"
20140225"
20140413"
20140518"
20140609"
20141013"
20141118"
20150218"
20150413"
20150518"
Reeks1" Reeks2" Reeks3" Reeks4"content links images overall

Web sphere
Page element
Web site
Web page
2010 2015
unesco classifications

Changerate (type of site)
0"
10"
20"
30"
40"
50"
60"
01" 02" 03" 04" 05" 06" 08" 09" 16" 17" 18" 19" 20" 22" 23" 24" 25" 30" 31"
Gemiddeld"van"conte
Gemiddeld"van"image
Gemiddeld"van"links"
Gemiddeld"van"comb
Changes per unesco category (all p/quarter harvests, n=~600, 2009-2015)
Meteorology
Law & government
History
Sports
Agriculture
01" 02" 03" 04" 05" 06" 08" 09" 16" 17" 18" 19" 20" 22" 23" 24" 25" 30" 31"
Gemiddeld"van"content"
Gemiddeld"van"images"
Gemiddeld"van"combined"

Changerate (all sites)
0"
5"
10"
15"
20"
25"
30"
35"
2009Q3"2009Q4"2010Q1"2010Q2"2010Q3"2010Q4"2011Q1"2011Q2"2011Q3"2011Q4"2012Q1"2012Q2"2012Q3"2012Q4"2013Q1"2013Q2"2013Q3"2013Q4"2014Q1"2014Q2"2014Q3"2014Q4"2015Q1"
Changerate (all p/quarter harvests, 2009-2015)
0"
5"
10"
15"
20"
25"
30"
35"
2009Q3"2009Q4"2010Q1"2010Q2"2010Q3"2010Q4"2011Q1"2011Q2"2011Q3"2011Q4"2012Q1"2012Q2"2012Q3"2012Q4"2013Q1"2013Q2"2013Q3"2013Q4"2014Q1"2014Q2"2014Q3"2014Q4"2015Q1"

3.2.2 Examining aggregated content summaries

3.2.2 Exploring Content Summaries
• Examine textual contents of a website
• for example, nu.nl
• most popular Dutch  
news site (Alexa, 2016)
• daily crawls by KB
• Exploration: different temporal site-level
summarizations

Jan’13 Feb’13 Mar’13 Apr’13
May’13 Jun’13 Jul’13 Aug’13
Sep’13 Oct’13 Nov’13 Dec’13

Organizations (NER)
201420132012
Persons (NER)
2013 2014 2015
Places (NER)

0"
20"
40"
60"
80"
100"
120"
20100722"
20100816"
20100817"
20110413"
20110610"
20110706"
20111013"
20111218"
20111220"
20120520"
20120613"
20120617"
20120618"
20120918"
20121014"
20121120"
20121221"
20121222"
20121222"
20130218"
20130413"
20130518"
20130611"
20130620"
20130818"
20131001"
20131013"
20131030"
20131101"
20131115"
20131118"
20131120"
20131130"
20131206"
20131220"
20131220"
20140118"
20140225"
20140413"
20140518"
20140609"
20141013"
20141118"
20150218"
20150413"
20150518"
3.2.3 Next: combining approaches

Conclusion
Towards Multidimensional Web Archive Access4

4.1 Conclusion
• Gap between researchers needs and data/access
• Researchers’ needs
• rich access, e.g. different analytical levels, temporal ranges
• Archive access
• mainly access at single page level (URLs and queries)
• Calls for new approaches to provide  
access to aggregated contents
• temporally and hierarchically

4.2 Our approach
• Starting from a selection instead of a query
• Potential support exploratory stages of (re)search
• Potential support analysis and comparisons
• Issues: which levels of a website to summarize
• experimental focus on homepages and underlying pages
• deeper layers: additional richness, additional issues
• custom ﬁle formats vs standardized formats
• Integration into access interfaces

Web Archive
4.3 Ongoing and Future Work
• Further extending our approach;
integration into WebARTist toolset
• providing new ways to explore material
in the archive (without using queries)
• Creating aggregated representations
of unarchived contents
• see “Lost but Not Forgotten: Finding
Pages on the Unarchived Web” (2015)
“Corpus Creation”
“Analysis”
“Dissemination”

References
• Ben-David A. & Huurdeman H. (2014). Web Archive Search as Research: Methodological and Theoretical
Implications. Alexandria Journal, Volume 25, No. 1 (2014)
• Brügger, N. (2013). Historical Network Analysis of the Web. Social Science Computer Review, 31(3), 306–321
• Brügger, N. (2014). Concluding Remarks. International Internet Preservation Consortium General
Consortium. Paris, France. Retrieved from: http://netpreserve.org/sites/default/files/attachments/Brugger.ppt
(April 19, 2015)
• Chu, C. M. (1999). Literary critics at work and their information needs: A research-phases model. Library &
Information Science Research, 21(2), 247–273.
• Dougherty, M., & Meyer, E. T. (2014). Community, tools, and practices in web archiving: The state-of-the-art in
relation to social science and humanities research needs. Journal of the Association for Information Science
and Technology, 65(11), 2195–2209. http://doi.org/10.1002/asi.23099
• Hockx-Yu, H. (2014). Access and Scholarly Use of Web Archives. Alexandria, 25(1-2), 113–127.
• Huurdeman, H. (2015). Towards Research Engines: Supporting Search Stages in Web archives. Presented at
Web Archives as Scholarly Sources conference, Aarhus University, Denmark.
• Huurdeman H., Kamps J., Samar T., de Vries A., Ben-David A., Rogers R. (2015). Finding Pages in the
Unarchived Web. International Journal on Digital Libraries.
• Huurdeman, H., & Kamps, J. (2014). From Multistage Information-seeking Models to Multistage Search
Systems. In Proceedings of the 5th Information Interaction in Context Symposium (pp. 145–154). New York,
NY, USA: ACM.
• Meho, L. I., & Tibbo, H. R. (2003). Modeling the information-seeking behavior of social scientists: Ellis’s study
revisited. Journal of the American Society for Information Science and Technology, 54(6), 570–587.
• Rogers R. (2013). Digital Methods. MIT Press 2013

Thanks & Acknowledgements
• The WebART team (’12-’16):  
Jaap Kamps, Richard Rogers,  
Arjen de Vries, Hugo Huurdeman,
Thaer Samar, Anat Ben-David,  
Sanna Kumpulainen
• We gratefully acknowledge the
collaboration with the Dutch Web
Archive of the National Library of the
Netherlands.
• This research was supported by the
Netherlands Organization for Scientific
Research (WebART project, NWO
CATCH # 640.005.001).

Towards Multidimensional Web Archive Access  
 
Creating & Analyzing Representations of Aggregated Web Content
Hugo Huurdeman Thaer Samar Jaap Kamps Arjen deVries
huurdeman@uva.nl
@timelessfuture

Towards Multidimensional Web Archive Access (IIPC 2016)

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Viewers also liked

Viewers also liked (8)

Similar to Towards Multidimensional Web Archive Access (IIPC 2016)

Similar to Towards Multidimensional Web Archive Access (IIPC 2016) (20)

More from TimelessFuture

More from TimelessFuture (16)

Recently uploaded

Recently uploaded (20)

Towards Multidimensional Web Archive Access (IIPC 2016)