Towards Research Engines: Supporting Search Stages in Web Archives (2015)
1. WebART project
Web Archive RetrievalTools
Jaap Kamps, Richard Rogers, Arjen deVries
Hildelies Balk, RenéVoorburg
!
Thaer Samar, Hugo Huurdeman, Sanna Kumpulainen
Flickr: LucViatour
2. !
Hugo Huurdeman!
University of Amsterdam!
huurdeman@uva.nl!
!
!
!
Towards Research Engines:
Supporting Search Stages in Web Archives
webarchiving.nl
Web Archives as Scholarly Sources conference, Aarhus University, 10 June 2015
3. Introduction
• Web archives preserve the fast-
changing Web
• By now containing Petabytes of
valuable Web data
!
• This could be a valuable resource,
however, archives have not
frequently been used for research
!
• Several underlying reasons exist.
Here, the focus is on potential
limitations in access
Flickr: laughingsquid
4. The concept of ‘task-sharing’
• We look at the concept of task-
sharing (Beaulieu, 1999)
!
• i.e. how should we design web
archive access systems to better
facilitate task-sharing between
scholar and system?
!
• Bottom-up approach: looking at
scholars’ use of Web data,
and how currents systems
support scholars’ needs
scholar
research task
system
6. 1.1 Study: scholars’ research phases
• Exploratory analysis of scholars’
research tasks (journal papers)!
• scholars using temporal Web data
!
• Use research phases as a ‘lens’
to analyze these papers
artist:
7. 1.1 Background: Research Phases
• Various scholars have
defined different
stages occurring in
research tasks
(Bronstein ’07; Chu ’99;
Meho & Tibbo ’03)
!
• Specifically, Brügger
(2014) has defined several
research phases relevant
to web archive research:
1. Corpus creation
2. Analysis
3. Dissemination
8. 1.2 Study: scholars’ research phases
• Method:!
• querying EBSCOhost using the CMMC (Communication & Mass
Media Complete), and LISTA (Library, Information Science &
Technology Abstracts) databases
!
• selecting all journal papers (2007-2015) which contain longitudinal
analyses (excluding computer science papers)
9. 1.2 Study: literature corpus overview
• 18 papers (17 distinct first authors)
!
• Main areas:
• Information Science
• Communication
• New Media
• Political Science
10. 1.2 Study: literature corpus overview
• Observation: various ways of
corpus definition, analysis and
dissemination in journal papers
!
• However, most papers in this
literature set did not use Web
archives as a data source
!
• Corresponds to large gap
potential community addressed
by web archives & small group
actually using them thus far
(Dougherty & Meyer, 2014)
11. 1.3.1 Study results: Corpus definition phase
• 1. selecting webpages or
websites, e.g. based on
authoritative lists (13)
!
• 2. querying regular search
engines (5)
!
• 3. taking a sample of
webpages (4)
!
• Often: combination of methods
e.g. the term ‘informetrics’ (Bar-Ilan, 2009), descriptors
of youth movements (Xenos & Bennet, 2007)
e.g. a list of insurance companies (Waite and Harrison,
2007)
e.g. one week per month (Li et al, 2014) ; to reduce
large size of corpus, or data bias (John, 2013)
13. • Current support:
• Most: Selecting URLs (Wayback Machine)
• Many: Querying the contents of the archive
• Few: Selecting (predefined) categories
• Very few: Sampling contents of the archive
• Current limitations:
• Defining, saving & sharing of corpora
• Document-centric access methods [Hockx-Yu, 14]
• Limitations of search [Ben-David & Huurdeman,14]
14. 1.3.2 Results: Analysis phase (1/2)
• Content analysis (66.7%)!
• manual coding
• coding schemes, at times based
on existing frameworks
!
• Content analysis (22.2%)
• automatic
• existing/customly developed tools
!
• Network analysis (11.1%)!
• issue crawler, link
classifications
15. 1.3.2 Results: Analysis phase (2/2)
• Level of analysis:
(b/o Brügger, 2013)!
!
• page element (4) (22%)
• e.g. mission statements
• web page (6) (33%)
• e.g. blog pages
• web site* (7) (39%)
• e.g. political actors’ sites
• web sphere (1) (6%)
• e.g. youth web sphere
web sphere (1)
website (7)
page element (4)
webpage (8)
16. • Current support
• Very few: analysis (n-gram,
trends), export options
• Current limitations:
• Generally not applicable to custom corpora
• No ways to define granularity of results
• Often have to resort to script-based analysis tools
• Lack of integrated content analysis, coding support, ..
1.3.2 Support: Analysis phase
18. 1.3.3 Support: Dissemination phase
• Current limitations
• Set of visualizations
depends on archive
• Generally not applicable
to user-defined corpora
• Current support
• some visualization options
(n-gram, tag clouds)
19. 1.4 Summary
• Observation: omissions in current
support for corpus creation,
analysis and dissemination in a
research context
!
• Opportunities arise to increase
task-sharing in future systems
scholar
research task
system
21. 2.1 Supporting the flow (1/2)
• How to integrate this varied set of features into an
integrated access system?
• with a high usability and without cognitive overload
!
!
!
!
!
!
!
• Traditional approach: “Complex” interface
integrating all functionality
Search
?
23. 2.1 Supporting the flow (2/2)
• Our approach: Divide functionality per (research) stage
!
• Inspired by ongoing work on supporting the flow of Web and
book search in multistage interfaces, based on cognitive models
of the search process
[Huurdeman & Kamps, 2014; Huurdeman, Kamps, Koolen & Kumpulainen, 2015]
Search
Corpus Creation
Search
Visualization
Search
Analysis
24. 2.2 Current research prototypes: b/o Dutch Web archive
• National Library of the
Netherlands (KB) !
!
• Selective Web archive (2007-now)!
• 10+ Terabyte (25,000+ harvests)
!
• Idea: modular system
25. 2.2.1 Supporting research phases: corpus creation
• faceted search
interface
• different modalities to
explore results
• possibility to
• save (complex)
queries
• save results
• categorize
Search
Corpus Creation
Saved queries
26. 2.2.1 Supporting research phases: corpus creation
• Further customization
’Under the hood’:
define search strategy
• via visual building blocks
• flexibility in defining a
corpus (determine
selection, ranking,
queries, etc)
[De Vries et al, 2010]
see also: spinque.com
Search
Corpus Creation
28. 2.2.3 Supporting research phases: dissemination
• Visualization interface!
• based on RAW
(raw.densitydesign.org)
• visualize datasets
(graphs and
visualizations)
Search
Dissemination
29. 2.3 Caveats & discussion
• Looking at access aspects
• not at underlying data & its properties
• next step: contextualizing ‘completeness’ of
results [see Huurdeman, Kamps, Samar, De Vries, Ben-
David & Rogers, 2015]
!
• Slightly utopian vision: not all analysis
can be supported
• generic versus specific approaches
• towards ‘toolmaker’s tools’
!
• Different archives offer different toolsets
• Importance of sharing (open-source) and
collaboration!
30. 2.4 Conclusion
• Exploratory analysis of scholars’
choices related to corpus
definition, analysis and
dissemination!
!
• These choices revealed a number
of limitations of current access
interfaces
!
• Therefore, we propose a more
fluid approach, moving from mere
search to ‘research engines’
Wayback
Machine
Search
engine
‘Research’
engine
33. Thanks & Acknowledgements
• The WebART team (’12-’16):
Jaap Kamps, Richard Rogers,
Arjen de Vries, Thaer Samar,
Sanna Kumpulainen;
and Anat Ben-David.
!
• We gratefully acknowledge the
collaboration with the Dutch Web
Archive of the National Library of the
Netherlands.
!
• This research was supported by the
Netherlands Organization for Scientific
Research (WebART project, NWO
CATCH # 640.005.001).
34. References
• Beaulieu, M. (2000). Interaction in information searching and retrieval. Journal of Documentation, 56(4), 431–439.
• Ben-David A. & Huurdeman H. (2014). Web Archive Search as Research: Methodological and Theoretical
Implications. Alexandria Journal, Volume 25, No. 1 (2014)
• Bronstein, J. (n.d.). The role of the research phase in information seeking behaviour of Jewish scholars: a
modification of Ellis’s behavioural characteristics. Retrieved April 20, 2015, from http://www.informationr.net/ir/12-3/
paper318.html
• Brügger, N. (2014). Concluding Remarks. International Internet Preservation Consortium General Consortium.
Paris, France. Retrieved from: http://netpreserve.org/sites/default/files/attachments/Brugger.ppt (April 19, 2015)
• Brügger, N. (2013). Historical Network Analysis of the Web. Social Science Computer Review, 31(3), 306–321
• Chu, C. M. (1999). Literary critics at work and their information needs: A research-phases model. Library &
Information Science Research, 21(2), 247–273.
• Dunne, C., Shneiderman, B., Gove, R., Klavans, J., & Dorr, B. (2012). Rapid understanding of scientific paper
collections: Integrating statistics, text analytics, and visualization. Journal of the American Society for Information
Science and Technology, 63(12), 2351–2369.
• Hockx-Yu, H. (2014). Access and Scholarly Use of Web Archives. Alexandria, 25(1-2), 113–127.
• Huurdeman H., Kamps J., Samar T., de Vries A., Ben-David A., Rogers R. (2015). Finding Pages in the Unarchived
Web. International Journal on Digital Libraries.
• Huurdeman H., Kamps J., Koolen M., Kumpulainen, S. (forthcoming). The Value of Multistage Interfaces for Book
Search. CEUR-WS.
• Huurdeman, H., & Kamps, J. (2014). From Multistage Information-seeking Models to Multistage Search Systems. In
Proceedings of the 5th Information Interaction in Context Symposium (pp. 145–154). New York, NY, USA: ACM.
• Meho, L. I., & Tibbo, H. R. (2003). Modeling the information-seeking behavior of social scientists: Ellis’s study
revisited. Journal of the American Society for Information Science and Technology, 54(6), 570–587.
• Rogers R. (2013). Digital Methods. MIT Press 2013
• de Vries A., Alink W., Cornacchia R. (2010). Search by Strategy. Proc. ESAIR '10
35. !
Hugo Huurdeman!
University of Amsterdam!
huurdeman@uva.nl!
!
!
!
Towards Research Engines:
Supporting Search Stages in Web Archives
webarchiving.nl
Web Archives as Scholarly Sources conference, Aarhus University, 10 June 2015