Successfully reported this slideshow.

Towards Research Engines: Supporting Search Stages in Web Archives (2015)

3

Share

1 of 35
1 of 35

More Related Content

More from TimelessFuture

Related Books

Free with a 14 day trial from Scribd

See all

Towards Research Engines: Supporting Search Stages in Web Archives (2015)

  1. 1. WebART project Web Archive RetrievalTools Jaap Kamps, Richard Rogers, Arjen deVries 
 Hildelies Balk, RenéVoorburg ! Thaer Samar, Hugo Huurdeman, Sanna Kumpulainen Flickr: LucViatour
  2. 2. ! Hugo Huurdeman! University of Amsterdam! huurdeman@uva.nl! ! ! ! Towards Research Engines: 
 Supporting Search Stages in Web Archives webarchiving.nl Web Archives as Scholarly Sources conference, Aarhus University, 10 June 2015
  3. 3. Introduction • Web archives preserve the fast- changing Web • By now containing Petabytes of valuable Web data ! • This could be a valuable resource, however, archives have not frequently been used for research ! • Several underlying reasons exist. Here, the focus is on potential limitations in access Flickr: laughingsquid
  4. 4. The concept of ‘task-sharing’ • We look at the concept of task- sharing (Beaulieu, 1999) ! • i.e. how should we design web archive access systems to better facilitate task-sharing between scholar and system? ! • Bottom-up approach: looking at scholars’ use of Web data,
 and how currents systems support scholars’ needs scholar research task system
  5. 5. 1 Scholars’ use of web data! & current support
  6. 6. 1.1 Study: scholars’ research phases • Exploratory analysis of scholars’ research tasks (journal papers)! • scholars using temporal Web data ! • Use research phases as a ‘lens’ to analyze these papers artist:
  7. 7. 1.1 Background: Research Phases • Various scholars have defined different 
 stages occurring in 
 research tasks 
 (Bronstein ’07; Chu ’99; 
 Meho & Tibbo ’03) ! • Specifically, Brügger 
 (2014) has defined several research phases relevant 
 to web archive research: 1. Corpus creation 2. Analysis 3. Dissemination
  8. 8. 1.2 Study: scholars’ research phases • Method:! • querying EBSCOhost using the CMMC (Communication & Mass Media Complete), and LISTA (Library, Information Science & Technology Abstracts) databases ! • selecting all journal papers (2007-2015) which contain longitudinal analyses (excluding computer science papers)
  9. 9. 1.2 Study: literature corpus overview • 18 papers (17 distinct first authors) ! • Main areas: • Information Science • Communication • New Media • Political Science
  10. 10. 1.2 Study: literature corpus overview • Observation: various ways of corpus definition, analysis and dissemination in journal papers ! • However, most papers in this literature set did not use Web archives as a data source ! • Corresponds to large gap potential community addressed by web archives & small group actually using them thus far (Dougherty & Meyer, 2014)
  11. 11. 1.3.1 Study results: Corpus definition phase • 1. selecting webpages or websites, e.g. based on authoritative lists (13) ! • 2. querying regular search engines (5) ! • 3. taking a sample of webpages (4) ! • Often: combination of methods e.g. the term ‘informetrics’ (Bar-Ilan, 2009), descriptors of youth movements (Xenos & Bennet, 2007) e.g. a list of insurance companies (Waite and Harrison, 2007) e.g. one week per month (Li et al, 2014) ; to reduce large size of corpus, or data bias (John, 2013)
  12. 12. 1.3.1 Study results: Corpus definition phase Query Selection Sample Query Selection Sample ➤ ➤ ➤ ➤ ➤ 13 5 1 3 4
  13. 13. • Current support: • Most: Selecting URLs (Wayback Machine) • Many: Querying the contents of the archive • Few: Selecting (predefined) categories • Very few: Sampling contents of the archive • Current limitations: • Defining, saving & sharing of corpora • Document-centric access methods [Hockx-Yu, 14] • Limitations of search [Ben-David & Huurdeman,14]
  14. 14. 1.3.2 Results: Analysis phase (1/2) • Content analysis (66.7%)! • manual coding • coding schemes, at times based on existing frameworks ! • Content analysis (22.2%) • automatic • existing/customly developed tools ! • Network analysis (11.1%)! • issue crawler, link classifications
  15. 15. 1.3.2 Results: Analysis phase (2/2) • Level of analysis:
 (b/o Brügger, 2013)! ! • page element (4) (22%) • e.g. mission statements • web page (6) (33%) • e.g. blog pages • web site* (7) (39%) • e.g. political actors’ sites • web sphere (1) (6%) • e.g. youth web sphere web sphere (1) website (7) page element (4) webpage (8)
  16. 16. • Current support • Very few: analysis (n-gram, trends), export options • Current limitations: • Generally not applicable to custom corpora • No ways to define granularity of results • Often have to resort to script-based analysis tools • Lack of integrated content analysis, coding support, .. 1.3.2 Support: Analysis phase
  17. 17. 1.3.3 Results: Dissemination phase • Tables (16) ! • Graphs (10) ! • Link networks (1) ! • Model (1)
  18. 18. 1.3.3 Support: Dissemination phase • Current limitations • Set of visualizations depends on archive • Generally not applicable to user-defined corpora • Current support • some visualization options (n-gram, tag clouds)
  19. 19. 1.4 Summary • Observation: omissions in current support for corpus creation, analysis and dissemination in a research context ! • Opportunities arise to increase task-sharing in future systems scholar research task system
  20. 20. 2 From Search to Research engines
  21. 21. 2.1 Supporting the flow (1/2) • How to integrate this varied set of features into an integrated access system? • with a high usability and without cognitive overload ! ! ! ! ! ! ! • Traditional approach: “Complex” interface 
 integrating all functionality Search ?
  22. 22. Dunne Dunne et al, 2012
  23. 23. 2.1 Supporting the flow (2/2) • Our approach: Divide functionality per (research) stage ! • Inspired by ongoing work on supporting the flow of Web and book search in multistage interfaces, based on cognitive models of the search process 
 [Huurdeman & Kamps, 2014; Huurdeman, Kamps, Koolen & Kumpulainen, 2015] Search Corpus Creation Search Visualization Search Analysis
  24. 24. 2.2 Current research prototypes: b/o Dutch Web archive • National Library of the Netherlands (KB) ! ! • Selective Web archive (2007-now)! • 10+ Terabyte (25,000+ harvests) ! • Idea: modular system
  25. 25. 2.2.1 Supporting research phases: corpus creation • faceted search interface • different modalities to explore results • possibility to • save (complex) 
 queries • save results • categorize Search Corpus Creation Saved queries
  26. 26. 2.2.1 Supporting research phases: corpus creation • Further customization ’Under the hood’: define search strategy • via visual building blocks • flexibility in defining a corpus (determine selection, ranking, queries, etc)
 [De Vries et al, 2010]
 see also: spinque.com Search Corpus Creation
  27. 27. 2.2.2 Supporting research phases: analysis • Analysis interface ! • edit/annotate dataset • search & browse dataset • analyze Search Analysis
  28. 28. 2.2.3 Supporting research phases: dissemination • Visualization interface! • based on RAW (raw.densitydesign.org) • visualize datasets (graphs and visualizations) Search Dissemination
  29. 29. 2.3 Caveats & discussion • Looking at access aspects • not at underlying data & its properties • next step: contextualizing ‘completeness’ of results [see Huurdeman, Kamps, Samar, De Vries, Ben- David & Rogers, 2015] ! • Slightly utopian vision: not all analysis can be supported • generic versus specific approaches • towards ‘toolmaker’s tools’ ! • Different archives offer different toolsets • Importance of sharing (open-source) and collaboration!
  30. 30. 2.4 Conclusion • Exploratory analysis of scholars’ choices related to corpus definition, analysis and dissemination! ! • These choices revealed a number of limitations of current access interfaces ! • Therefore, we propose a more fluid approach, moving from mere search to ‘research engines’ Wayback Machine Search engine ‘Research’ engine
  31. 31. webarchiving.nl @webart12
  32. 32. Thanks & Acknowledgements • The WebART team (’12-’16): 
 Jaap Kamps, Richard Rogers, 
 Arjen de Vries, Thaer Samar, 
 Sanna Kumpulainen; 
 and Anat Ben-David. ! • We gratefully acknowledge the collaboration with the Dutch Web Archive of the National Library of the Netherlands. ! • This research was supported by the Netherlands Organization for Scientific Research (WebART project, NWO CATCH # 640.005.001).
  33. 33. References • Beaulieu, M. (2000). Interaction in information searching and retrieval. Journal of Documentation, 56(4), 431–439. • Ben-David A. & Huurdeman H. (2014). Web Archive Search as Research: Methodological and Theoretical Implications. Alexandria Journal, Volume 25, No. 1 (2014) • Bronstein, J. (n.d.). The role of the research phase in information seeking behaviour of Jewish scholars: a modification of Ellis’s behavioural characteristics. Retrieved April 20, 2015, from http://www.informationr.net/ir/12-3/ paper318.html • Brügger, N. (2014). Concluding Remarks. International Internet Preservation Consortium General Consortium. Paris, France. Retrieved from: http://netpreserve.org/sites/default/files/attachments/Brugger.ppt (April 19, 2015) • Brügger, N. (2013). Historical Network Analysis of the Web. Social Science Computer Review, 31(3), 306–321 • Chu, C. M. (1999). Literary critics at work and their information needs: A research-phases model. Library & Information Science Research, 21(2), 247–273. • Dunne, C., Shneiderman, B., Gove, R., Klavans, J., & Dorr, B. (2012). Rapid understanding of scientific paper collections: Integrating statistics, text analytics, and visualization. Journal of the American Society for Information Science and Technology, 63(12), 2351–2369. • Hockx-Yu, H. (2014). Access and Scholarly Use of Web Archives. Alexandria, 25(1-2), 113–127. • Huurdeman H., Kamps J., Samar T., de Vries A., Ben-David A., Rogers R. (2015). Finding Pages in the Unarchived Web. International Journal on Digital Libraries. • Huurdeman H., Kamps J., Koolen M., Kumpulainen, S. (forthcoming). The Value of Multistage Interfaces for Book Search. CEUR-WS. • Huurdeman, H., & Kamps, J. (2014). From Multistage Information-seeking Models to Multistage Search Systems. In Proceedings of the 5th Information Interaction in Context Symposium (pp. 145–154). New York, NY, USA: ACM. • Meho, L. I., & Tibbo, H. R. (2003). Modeling the information-seeking behavior of social scientists: Ellis’s study revisited. Journal of the American Society for Information Science and Technology, 54(6), 570–587. • Rogers R. (2013). Digital Methods. MIT Press 2013 • de Vries A., Alink W., Cornacchia R. (2010). Search by Strategy. Proc. ESAIR '10
  34. 34. ! Hugo Huurdeman! University of Amsterdam! huurdeman@uva.nl! ! ! ! Towards Research Engines: 
 Supporting Search Stages in Web Archives webarchiving.nl Web Archives as Scholarly Sources conference, Aarhus University, 10 June 2015

×