Search in Research, Let's Make it More Complex!

Search in Research, Let’s Make it More Complex!
Collaboratively Looking Under the Hood and Its
Consequences
Marijn Koolen
Humanities Cluster - Royal Netherlands Academy of Arts and Sciences
CLARIAH Media Studies Summer School
Netherlands Institute for Sound and Vision, 3 July 2018
Overview
1. Search in Research
a. Search as part of research process
b. Search vs. other access methods
2. Search, Retrieval and Ranking
a. Retrieval Systems, Ranking Algorithms and Relevance Models
3. Searching in Digital Collections
a. Understanding (digital) collections and their construction
b. Tool analysis through experimentation
4. Search Strategies and Corpus Building
a. Systematic searching
b. Search strategies and sampling
1. Search in Research
● Research Phases
○ Exploration, gathering, analysis, synthesis, presentation
○ Extremely non-linear (affordance of digital realm)
● Search happens throughout research process
○ Search phases: pre-focus, focus, post-focus
○ Use different types of collections and search engines
■ General purpose search engines,
■ Domain- and collection-specific (e.g. GLAMS),
■ Personal/private (offline) collections
○ Search strategies:
■ Ad hoc or systematic: berrypicking (Bates 1989), keyword harvesting (Burke 2011), …
■ Important for data and tool criticism
Research Process
● For many online materials access is limited to search interface
○ Browsing is guided by available structure
■ Drill down via facets
■ Navigate via metadata fields (if enabled)
○ Without (relevant) structure, direct search is only practical alternative
● Searching as exploration
○ How does search engine provide overview?
■ How big is collection?
■ How is collection structure communicated?
■ What (meta)data is available?
■ How are search characteristics explained?
■ How are search results summarised?
Search Engine as Mediator
● Browsing brings you along unintended materials:
○ Navigating your way to relevance
○ Impresses on you what else there is (see also Putnam 2016)
● Keyword search tends to focus on relevance
○ Pushes back related/nearby materials
○ Collection structure can be enabled to allow faceting (overview)
● Search and research methodology
○ Impact of digital keyword search needs to be reflected in methodology
○ How do you account for search process in scholarly communication?
■ Method of citation is based on analogue browse/search in archives and libraries
■ Pre-focus to focus: switch between ad hoc and systematic?
■ Non-linearity: exploration never stops, assumptions constantly challenged
Browsing vs. Keyword Searching
'To take a single example of this disconnect between research process and representation, many of us
use and cite eighteenth and nineteenth-century newspapers as simple hard-copy references without
mention of how we navigated to the specific article, page and issue. In doing so, we actively misrepresent
the limitations within which we are working.' (Hitchcock 2013, 12)
'This is not only about being explicit about our use of keyword searching - it is about moving beyond a
traditional form of scholarship to data modelling and to what Franco Moretti calls “distant reading”.'
(Hitchcock, Confronting the Digital, 2013, p. 19).
Keyword Search and “Confronting the Digital”
Information Search and Seeking
● Search takes place in context
○ Part of seeking, and overall inf. behaviour (Wilson)
○ As inf. behaviour changes (phases), so does seeking
and search behaviour
● Reflection-in-action
○ When and where are choice points?
○ How do search actions relate to strategy and inf.
need?
Digital Tool Criticism
Search and Accountability
● What should scholars account for?
○ Aspects of sources, tools and process
● Digital source criticism
○ How to evaluate digital sources (Fickers 2012)
○ Who made digital source, when, why, what for, how?
● Digital tool criticism
○ How to evaluate impact of digital tools (Koolen et al. 2018)
○ Reflection-in-action, experimentation
● Data Scopes
○ How to communicate research process to others (Hoekstra & Koolen 2018)
○ Discuss process of selection, modelling, normalization, linking, classification
2. Search, Retrieval and Ranking
Anatomy of Retrieval Process
Retrieval - Matching and Similarity
● Matching based on user query
○ Query: free text, controlled facet, example (doc, AV or text)
○ Matching docs returned in certain order (non-matching are not retrieved)
■ How does search engine perform matching (esp. for free text and example)?
■ Potentially many objects match query: does order matter?
● Similarity
○ Degree of matching: some match better than others (notion of similarity)
■ Retrieve most similar documents first (ranking)
○ Similar how? Does interface explain?
● Retrieval and ranking
○ Retrieval: which matching documents are returned to the user as results?
○ Ranking: in which order are the results returned?
Retrieval, Ranking and Relevance
● Retrieval results form a set
○ Can be ordered or unordered (e.g. SQL or SPARQL query)
■ Even unordered sets need to be presented to the user in some order
○ Criteria for ordering: alphabetic, size, recency, popularity (views, likes, citations, links)
■ Ordering re-organizes materials, temporarily disrupts “original” organization
■ Provides different view on materials
● Many systems perform relevance ranking
○ Relevant to who or what?
■ Query: document similarity scores
■ User: e.g. search history, preferences
■ Situation: user, location, time, device, query, work context (page views, annotations)
■ Other aspects: quality, diversity, controversy, polarity, exploration/exploitation, ...
● How does an algorithm understand the notion of relevance?
○ Statistical interpretation:
■ Generally: frequent words carry less signal, look for unexpected stuff
■ Many ways of scoring signal
○ TF-IDF:
■ Term Frequency in document (relevance of term in document)
■ Inverse of Document Frequency in collection (commonness of term across docs)
○ Probabilistic Language Model (PLM):
■ Probability of picking term from document as bag of words (relevance of term in doc)
■ Probability of picking term from collection as bag of words (commonness of term)
○ Many other relevance models, e.g. BM25, DFR, SDM, …
■ Different interpretations of relevance, hence different rankings
Algorithmic Interpretation of Relevance
Search in Research, Let's Make it More Complex!
Search in Research, Let's Make it More Complex!
Ranking Issues
● Document length
○ TF-IDF doesn’t model document length, favours longer documents
○ PLM explicitly normalizes on document length, favours shorter documents
○ Upshot: Delpher API returns short documents first for short queries
● Document priors: are all documents equal or not?
○ Can use document prior probability (independent of query)
○ Can favour documents that are more popular, recent, authoritative, …
○ Can favour documents that are more appropriate for situation (location, time of day, …)
● Problem: how do you know how search engine scores relevance?
○ How much should you know about it?
○ Many GLAM search engines have relatively straightforward relevance models, no doc priors
○ Google uses many hundreds of features for document, query, user and situation
Relevance in Metadata Records
● Relevance ranking of metadata records
○ Metadata records are peculiar textual representations
■ Minimal amount of text, low redundancy
■ Majority of terms occur only once
○ Which part of TF-IDF contributes more to score of metadata record?
○ Which fields are useful/used for matching?
● NISV collection
○ Search engine indexes metadata records
■ Some records have lengthy itemized descriptions, some have not
■ Some have transcripts, some have not
○ Consequences for retrieving? And for ranking?
■ How does search engine handle this?
■ How does search engine communicate this?
● Hard to match keywords against AV signal directly
○ Option: use text representation for AV document
■ E.g. metadata, description, script, speech transcript, ...
○ Option: use AV representation of query
■ E.g. example document or user recording
■ Use audio or visual similarity (again, similar how?)
Retrieving and Ranking Audiovisual Materials
● Experiment to understand search functionalities
○ How can you find out if multiple search terms are treated with Boolean AND or OR operators?
○ How can you find out if terms are stemmed/normalized?
● Phrase search:
○ What happens when you use quotation marks to group terms into a phrase?
○ How do the results compare to those using no quotation marks?
● Proximity search:
○ Can you specify that terms should be near each other?
● Fuzzy search: wildcard and edit distance searches
○ Controlling lexical variation vs. uncontrolled wildcard search
○ voetbal+voetballen vs. voetbal* (matches voetbalvereniging, voetbalveld, ...)
Opaqueness of Interfaces and Experimentation
● Experiment with Search and Compare tools of the CLARIAH Mediasuite
○ Find out if stopwords are removed
○ Find out if words are stemmed/normalized
○ Find out how multi-word queries are interpreted, i.e. as AND or OR
○ Find out how standard search operators work
■ Boolean AND, OR and NOT
■ Quotation marks for phrases
Exercise
3. Searching in Digital Collections
● Collections of GLAMs are often built up over decades
○ Based on aims and selection criteria
■ Rarely "complete", dependent on availability of materials
○ Digital access via digitization, or digital archiving (born-digital)
■ Some things are lost in this process (e.g. context, quality, …)
● Heterogeneity: mix of object/source types (sub-collections)
○ Different modalities, different ways of accessing and presenting
■ Text vs. Image vs. AV vs. 3D (or 4D)
Nature of Digital Collections
Nature of Metadata
● Digital access via metadata
○ Metadata: data about the object/source
○ Types: formal, structural, technical, administrative, aboutness
○ Metadata fields allow selection and search via specific fields
■ Title, description, creator, creation date, genre, …
○ Allows (seemingly) uniform access to heterogeneous collections
■ But, different materials have different aspects to describe
■ Edition is relevant for books and films, not so much for paintings
● Metadata creation process
○ Often done with limited time, information and system flexibility
○ Inherently subjective, especially content analysis
● Size matters
○ Requirements change as size of collection grows (also depends on expectations)
● Hierarchical organization
○ 4 levels
■ Series: De Wereld Draait Door
■ Season: De Wereld Draait Door 2016
■ Program: De Wereld Draait Door 21-06-2016
■ Segment: De Wereld Draait Door 21-06-2016
○ Each level has a metadata record (with overlap in field, e.g. title)
● Follows archival standard
○ Describe aspect at highest relevant level
○ Don’t repeat at lower levels unless it deviates (e.g. main titles)
○ Fonds: aggregation of documents from same origin
Archival Structure and NISV Audiovisual Collection
● Power of the archive
○ Problem of perspective (from archive-as-source to archive-as-subject, Stoler 2002)
● History of the archive
○ Collections created over decades often go through changes in
■ selection criteria, cataloguers (human or algorithm),
■ cataloguing budgets, policies, rules, practice and vocabularies,
■ software (migrations and updates), hardware,
■ institutional mission, societal attitudes, …
○ Most of these aspects remain undocumented or partially documented
● Consequences
○ Almost inherently incomplete, inconsistent and sometimes necessarily incorrect
○ After many years, it's hard to retrace what happened
■ and how it affects access, selection and analysis
Digital Source and Data Criticism
Metadata in theory Metadata in practice
Source: Jaap Kamps
Combined Collections
● Several portals combine (heterogeneous) collections
○ Examples:
■ Europeana, European Newspapers, EU screen, Nederlab, Delpher, Online Archives of
California, …
○ Worldwide aggregated collections:
■ ArchiveGrid (1000+ archives): over 5M finding aids
■ WorldCat (72,000 libraries): 400M records, 2.6B assets, 100M persons
● Huge challenge for source criticism as well as search
○ Collections vary in size, provenance, selection criteria, metadata policies, interpretation and
richness
○ Heterogeneous metadata schemas have been mapped to single schema
■ Causes problems for interpretation
■ E.g. what does creator mean for paintings, films, tv series, letters, advertisements, ...?
Assessing Metadata Quality
● Questions
○ What are pitfalls in relying on metadata?
○ How can we evaluate metadata quality?
○ What are relevant aspects to consider?
● Collection inspection
○ In CLARIAH Media Suite we created a tool for inspecting metadata
■ Esp. useful for complex collections like NISV audiovisual collection
■ Somewhat ad hoc, please feel encouraged to give feedback!
○ Please go to the Media Suite and go to the Collection Inspector tool
■ Click on “select field to analyse” and let the interface load the data on completeness (this
will take awhile)
Search in Research, Let's Make it More Complex!
Search in Research, Let's Make it More Complex!
Search in Research, Let's Make it More Complex!
Assessing Timelines and Other Visualizations
● Timeline visualizations give view of temporal spread
○ Very difficult to interpret properly
● Issues with absolute frequencies:
○ Collection materials not evenly distributed
○ Need to compare query-specific distribution to collection-distribution
● Issues with relative frequencies:
○ Incompleteness not evenly distributed (use collection inspector)
Retrievability and Metadata Characteristics
● Different types of metadata fields
○ Controlled vocabulary: e.g. broadcast channel (radio or tv)
○ Number: number of episodes/seasons/segments
○ Time/date: program length, recording date
○ Free keyword/keyphrase: title, person name (tend to be non-unique)
○ Free text: description, summary, transcript, … (tend to be unique)
● Different types allow different forms of retrieval and ranking
○ Long text fields have more terms, with higher frequencies
■ Some types of programs have longer descriptions/transcript
■ These match more queries, so higher chance of being retrieved
■ Impact of long text fields on ranking depends on relevance model!
○ Repeated values allow aggregation, navigation
● Some search interfaces offer facets to narrow down search results
○ E.g. broadcaster and genre in the CLARIAH Media Suite
○ Facets provide overview, afford focusing through selection
● How do facets work?
○ Based on metadata fields: rich schema has rich options for facets
○ Types of metadata fields: controlled vocab, number, date, keyword/phrase, free text
■ Facets work for field with limited range of values, so not free text fields
○ Long tails in facets: typically, few high frequency, many low frequency values
Metadata and Search Facets
Search in Research, Let's Make it More Complex!
Exercise
● Experiment with the Collection Inspector of the CLARIAH Mediasuite
○ Try out the collection inspector:
■ Scroll through the list of fields to get an idea of what is available
■ Look at completeness of fields for f.i. “genre”, “keywords” and “awards”
■ Which metadata fields are relatively complete?
■ At which archival levels are they most complete?
● Explore which fields are available and which fields make good facets
○ Explore facet distributions in entire collection and for specific queries
4. Search Strategies and Corpus
Building
● Importance of selection criteria
○ Do you have to hand pick each document?
○ Or can you select sets based on matching criteria?
○ Is representativeness important? If so, representativeness of what?
○ Or completeness? Why?
● Exploiting facets and dates
○ Filtering: align facets/dates with research focus
○ Sampling: compare across facets
■ Which facet types can you use?
○ Sampling strategies
■ Sample per facet/year (e.g. X items per facet/year)
■ Within facets, select random or not
Searching for Corpus Building
Tracking Context in Corpus Building
● Why were certain documents selected?
○ How were they selected?
○ What strategy was used?
○ Documenting helps understanding/remembering choices?
● Do research goals and questions change during collection?
○ Interacting with sources during search updates knowledge structures (Vakkari 2016)
○ Updates tend to be small and incremental, hence barely noticeable
○ Explicit reflection-in-action can bring these to the surface (Koolen et al. 2018)
○ Adding annotations can also provide context
Systematic Searching
● Systematic (comprehensive) search has two factors (Yakel 2010):
○ Search strategy (user)
○ Search functionalities (system)
○ Functionalities shape/affect strategy
● Step 1: systematic search for relevant collections online
○ Different collections/sites offer different search functionalities and levels of detail
○ Explicitly address what consequences this has for your strategy and research goals
● Step 2:
○ Explore individual collections using one or more strategies
○ "Researchers need to be flexible and creative to accommodate the vagaries of cataloging
practices." (Yakel 2010, p. 110)
○ Footnote and reference chasing: references often give an "information scent", suggesting
other collections and items to explore.
Search Strategies
● Web search strategies defined by Drabenstott (2001)
○ Discussed in archive context by Yakel (2010)
● Five strategies
○ Synonym generation
○ Chaining
○ Name collection
○ Pearl growing
○ Successive segmentation
● Somewhat related to information seeking patterns by Ellis (1989)
○ Starting, chaining, browsing, differentiating, monitoring, extracting
● Synonym generation: 1) search with relevant term, 2) close read results to
identify related terms (wordclouds, facets), 3) search via related terms for
synonyms.
● Chaining: follow references/citations (explicit or implicit), identify relevant
subset and use explicit structure to explore connected/related subset
● Name collection: search with keywords, identify relevant names, search with
names, identify related names and keywords, repeat. Similar to keyword
harvesting (Burke 2011).
Drabenstott’s Strategies (1/2)
Drabenstott’s Strategies (2/2)
● Pearl growing: start small and focused with specific search terms, slowly
expand out with additional terms to broader topics/themes
● Successive segmentation: opposite of pearl growing; start broad and
increasingly zoom in and focus; e.g. make queries increasingly specific by
adding (ANDing) keywords, replace broad terms with lower frequency terms,
or select facets
Search Strategies and Research Phases
● Research phase
○ Exploration <-> search phase pre-focus
i. Ad hoc, no need yet for systematic search
ii. Mostly pearl growing and/or successive segmentation to determine focus
○ Analysis <-> search phase focus
i. Switch to systematic, determine strategy
ii. Use chaining, name collection, synonym generation (for coverage/representation,
boundaries)
● But reality resists:
○ (Re)search process is very non-linear
○ Boundary between exploration and analysis is not always clear
○ Late discoveries can prompt or force new directions, ...
When To Stop
● Often switch from exploration to “sorta” systematic search
○ But hard to remember and explain what and how you searched
○ Moreover, difficult to determine when to stop
○ Explicit strategy allows for stopping criteria
● Stopping criteria
○ Check whole set/sample, all available facets, ...
○ Diminishing returns: you increasingly encounter seen things, new relevance becomes rare
○ When stopping, make explicit (at least for yourself) when and why you stopped
● Meta-strategy:
○ chance strategy/tactics
○ E.g. successive segmentation -> harvest keywords -> switch segment -> harvest keywords, ...
Wrap Up
● Search in research
○ How to incorporate these processes in research methodology
● Large, heterogeneous collections introduce issues for research
○ Assessing incompleteness of materials
○ Assessing incompleteness, incorrectness and inconsistency of metadata
● Looking under the hood
○ Evaluating information access functionalities (search and browse)
○ Selecting an appropriate search strategy for research goals
○ Determining success/failure of searches
○ Understanding search for corpus building
Burke, T. 2011. How I Talk About Searching, Discovery and Research in Courses. May 9, 2011.
Drabenstott, K.M., 2001. Web Search Strategy Development. Online, 25(4), pp.18-25.
Fickers, F. 2012. Towards a New Digital Historicism? Doing History in the Age of Abundance. View
journal, volume 1 (1). http://orbilu.uni.lu/bitstream/10993/7615/1/4-4-1-PB.pdf
Hitchcock, T. 2013. Confronting the Digital - Or How Academic History Writing Lost the Plot. Cultural and
Social History, Volume 10, Issue 1, pp. 9-23. https://doi.org/10.2752/147800413X13515292098070
Hoekstra, R., M. Koolen. 2018. Data Scopes for Digital History Research. Historical Methods: A Journal of
Quantitative and Interdisciplinary History, Volume 51 (2), 2018.
References
References
Koolen, M., J. van Gorp, J. van Ossenbruggen. 2018. Lessons Learned from a Digital Tool Criticism
Workshop. Digital Humanities in the Benelux 2018 Conference.
Putnam L. 2016. The Transnational and the Text-Searchable: Digitized Sources and the Shadows They
Cast. American Historical Review, Volume 121, Number 2, pp. 377-402.
Vakkari, P. 2016. Searching as Learning: A systematization based on literature. Journal of Information
Science, 42(1) 2016, pp. 7-18.
Yakel, E., 2010. Searching and seeking in the deep web: Primary sources on the internet. Working in the
archives: Practical research methods for rhetoric and composition, pp.102-118.
1 of 50

Recommended

Tool criticism by
Tool criticismTool criticism
Tool criticismMarijn Koolen
20 views23 slides
Tools that Encourage Criticism - Leiden University Symposium on Tools Criticism by
Tools that Encourage Criticism - Leiden University Symposium on Tools CriticismTools that Encourage Criticism - Leiden University Symposium on Tools Criticism
Tools that Encourage Criticism - Leiden University Symposium on Tools CriticismMarijn Koolen
12 views44 slides
Lessons Learned from a Digital Tool Criticism Workshop by
Lessons Learned from a Digital Tool Criticism WorkshopLessons Learned from a Digital Tool Criticism Workshop
Lessons Learned from a Digital Tool Criticism WorkshopMarijn Koolen
76 views29 slides
Influence of Timeline and Named-entity Components on User Engagement by
Influence of Timeline and Named-entity Components on User Engagement Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Roi Blanco
856 views31 slides
Matrix Queries and Matrix Data Representations in NVivo 11 Plus by
Matrix Queries and Matrix Data Representations in NVivo 11 PlusMatrix Queries and Matrix Data Representations in NVivo 11 Plus
Matrix Queries and Matrix Data Representations in NVivo 11 PlusShalin Hai-Jew
1.8K views54 slides
Writing and Publishing about Applied Technologies in Tech Journals and Books by
Writing and Publishing about Applied Technologies in Tech Journals and BooksWriting and Publishing about Applied Technologies in Tech Journals and Books
Writing and Publishing about Applied Technologies in Tech Journals and BooksShalin Hai-Jew
1.2K views188 slides

More Related Content

What's hot

Presentation Timo Kouwenhoven FIATIFTA by
Presentation Timo Kouwenhoven FIATIFTAPresentation Timo Kouwenhoven FIATIFTA
Presentation Timo Kouwenhoven FIATIFTATimo Kouwenhoven
506 views47 slides
"Mass Surveillance" through Distant Reading by
"Mass Surveillance" through Distant Reading"Mass Surveillance" through Distant Reading
"Mass Surveillance" through Distant ReadingShalin Hai-Jew
710 views81 slides
Semantic Search by
Semantic SearchSemantic Search
Semantic Searchsssw2012
1.7K views65 slides
Large-Scale Semantic Search by
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic SearchRoi Blanco
789 views36 slides
Text REtrieval Conference (TREC) Dynamic Domain Track 2015 by
Text REtrieval Conference (TREC) Dynamic Domain Track 2015Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015Grace Hui Yang
349 views39 slides
Educational Standards Webinar - Sept 2015 - Patricia Payton by
Educational Standards Webinar - Sept 2015 - Patricia PaytonEducational Standards Webinar - Sept 2015 - Patricia Payton
Educational Standards Webinar - Sept 2015 - Patricia PaytonBookNet Canada
656 views30 slides

What's hot(14)

Presentation Timo Kouwenhoven FIATIFTA by Timo Kouwenhoven
Presentation Timo Kouwenhoven FIATIFTAPresentation Timo Kouwenhoven FIATIFTA
Presentation Timo Kouwenhoven FIATIFTA
Timo Kouwenhoven506 views
"Mass Surveillance" through Distant Reading by Shalin Hai-Jew
"Mass Surveillance" through Distant Reading"Mass Surveillance" through Distant Reading
"Mass Surveillance" through Distant Reading
Shalin Hai-Jew710 views
Semantic Search by sssw2012
Semantic SearchSemantic Search
Semantic Search
sssw20121.7K views
Large-Scale Semantic Search by Roi Blanco
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic Search
Roi Blanco789 views
Text REtrieval Conference (TREC) Dynamic Domain Track 2015 by Grace Hui Yang
Text REtrieval Conference (TREC) Dynamic Domain Track 2015Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Grace Hui Yang349 views
Educational Standards Webinar - Sept 2015 - Patricia Payton by BookNet Canada
Educational Standards Webinar - Sept 2015 - Patricia PaytonEducational Standards Webinar - Sept 2015 - Patricia Payton
Educational Standards Webinar - Sept 2015 - Patricia Payton
BookNet Canada656 views
How search engines work Anand Saini by Dr,Saini Anand
How search engines work Anand SainiHow search engines work Anand Saini
How search engines work Anand Saini
Dr,Saini Anand574 views
Graph Models for Deep Learning by Experfy
Graph Models for Deep LearningGraph Models for Deep Learning
Graph Models for Deep Learning
Experfy164 views
Taxonomy design best practices by voginip
Taxonomy design best practices Taxonomy design best practices
Taxonomy design best practices
voginip1.4K views
Information searching & retrieving techniques khalid by Khalid Mahmood
Information searching & retrieving techniques khalidInformation searching & retrieving techniques khalid
Information searching & retrieving techniques khalid
Khalid Mahmood9.7K views
Capitalizing on Machine Reading to Engage Bigger Data by Shalin Hai-Jew
Capitalizing on Machine Reading to Engage Bigger DataCapitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger Data
Shalin Hai-Jew959 views
Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10 by Shalin Hai-Jew
Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10
Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10
Shalin Hai-Jew1.1K views
Bringing semantic publishing into TEI: ideas and pointers by University of Bologna
Bringing semantic publishing into TEI: ideas and pointersBringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointers
Spatial Decision Support Portal- Presented at AAG 2010 by Nathan Strout
Spatial Decision Support Portal- Presented at AAG 2010Spatial Decision Support Portal- Presented at AAG 2010
Spatial Decision Support Portal- Presented at AAG 2010
Nathan Strout265 views

Similar to Search in Research, Let's Make it More Complex!

Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an... by
Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an...Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an...
Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an...Marijn Koolen
130 views74 slides
A hands-on approach to digital tool criticism: Tools for (self-)reflection by
A hands-on approach to digital tool criticism: Tools for (self-)reflectionA hands-on approach to digital tool criticism: Tools for (self-)reflection
A hands-on approach to digital tool criticism: Tools for (self-)reflectionMarijn Koolen
237 views32 slides
Data Scopes - Towards transparent data research in digital humanities (Digita... by
Data Scopes - Towards transparent data research in digital humanities (Digita...Data Scopes - Towards transparent data research in digital humanities (Digita...
Data Scopes - Towards transparent data research in digital humanities (Digita...Marijn Koolen
78 views27 slides
Managing Ireland's Research Data - 3 Research Methods by
Managing Ireland's Research Data - 3 Research MethodsManaging Ireland's Research Data - 3 Research Methods
Managing Ireland's Research Data - 3 Research MethodsRebecca Grant
70 views23 slides
Starr Hoffman - Data Collection & Research Design by
Starr Hoffman - Data Collection & Research Design Starr Hoffman - Data Collection & Research Design
Starr Hoffman - Data Collection & Research Design National Information Standards Organization (NISO)
2.4K views34 slides
Search & Recommendation: Birds of a Feather? by
Search & Recommendation: Birds of a Feather?Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?Toine Bogers
1.1K views38 slides

Similar to Search in Research, Let's Make it More Complex!(20)

Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an... by Marijn Koolen
Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an...Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an...
Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an...
Marijn Koolen130 views
A hands-on approach to digital tool criticism: Tools for (self-)reflection by Marijn Koolen
A hands-on approach to digital tool criticism: Tools for (self-)reflectionA hands-on approach to digital tool criticism: Tools for (self-)reflection
A hands-on approach to digital tool criticism: Tools for (self-)reflection
Marijn Koolen237 views
Data Scopes - Towards transparent data research in digital humanities (Digita... by Marijn Koolen
Data Scopes - Towards transparent data research in digital humanities (Digita...Data Scopes - Towards transparent data research in digital humanities (Digita...
Data Scopes - Towards transparent data research in digital humanities (Digita...
Marijn Koolen78 views
Managing Ireland's Research Data - 3 Research Methods by Rebecca Grant
Managing Ireland's Research Data - 3 Research MethodsManaging Ireland's Research Data - 3 Research Methods
Managing Ireland's Research Data - 3 Research Methods
Rebecca Grant70 views
Search & Recommendation: Birds of a Feather? by Toine Bogers
Search & Recommendation: Birds of a Feather?Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?
Toine Bogers1.1K views
Information Retrieval Fundamentals - An introduction by Grace Hui Yang
Information Retrieval Fundamentals - An introduction Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction
Grace Hui Yang1.7K views
Trendspotting: Helping you make sense of large information sources by Marieke Guy
Trendspotting: Helping you make sense of large information sourcesTrendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sources
Marieke Guy677 views
Search, Report, Wherever You Are: A Novel Approach to Assessing User Satisfac... by Rachel Vacek
Search, Report, Wherever You Are: A Novel Approach to Assessing User Satisfac...Search, Report, Wherever You Are: A Novel Approach to Assessing User Satisfac...
Search, Report, Wherever You Are: A Novel Approach to Assessing User Satisfac...
Rachel Vacek251 views
Requirements for Learning Analytics by Tore Hoel
Requirements for Learning AnalyticsRequirements for Learning Analytics
Requirements for Learning Analytics
Tore Hoel1.6K views
Tutorial: CASRAI Standards Development (for a non-technology audience) - Davi... by CASRAI
Tutorial: CASRAI Standards Development (for a non-technology audience) - Davi...Tutorial: CASRAI Standards Development (for a non-technology audience) - Davi...
Tutorial: CASRAI Standards Development (for a non-technology audience) - Davi...
CASRAI140 views
A Research Plan to Study Impact of a Collaborative Web Search Tool on Novice'... by Karthikeyan Umapathy
A Research Plan to Study Impact of a Collaborative Web Search Tool on Novice'...A Research Plan to Study Impact of a Collaborative Web Search Tool on Novice'...
A Research Plan to Study Impact of a Collaborative Web Search Tool on Novice'...
A Framework For Effective Content Strategy Based On Heuristic Evaluation (Res... by Nim Dvir
A Framework For Effective Content Strategy Based On Heuristic Evaluation (Res...A Framework For Effective Content Strategy Based On Heuristic Evaluation (Res...
A Framework For Effective Content Strategy Based On Heuristic Evaluation (Res...
Nim Dvir723 views
QQML Panel 2014: Pratt Institute SILS by A. M. Kelleher
QQML Panel 2014: Pratt Institute SILSQQML Panel 2014: Pratt Institute SILS
QQML Panel 2014: Pratt Institute SILS
A. M. Kelleher639 views
Using Qualitative Methods for Library Evaluation: An Interactive Workshop by OCLC
Using Qualitative Methods for Library Evaluation: An Interactive WorkshopUsing Qualitative Methods for Library Evaluation: An Interactive Workshop
Using Qualitative Methods for Library Evaluation: An Interactive Workshop
OCLC2.5K views
Using Qualitative Methods for Library Evaluation: An Interactive Workshop by Lynn Connaway
Using Qualitative Methods for Library Evaluation: An Interactive WorkshopUsing Qualitative Methods for Library Evaluation: An Interactive Workshop
Using Qualitative Methods for Library Evaluation: An Interactive Workshop
Lynn Connaway147 views
Survey Research Methods with Lynn Silipigni Connaway by Lynn Connaway
Survey Research Methods with Lynn Silipigni ConnawaySurvey Research Methods with Lynn Silipigni Connaway
Survey Research Methods with Lynn Silipigni Connaway
Lynn Connaway578 views
Review of search and retrieval strategies by Abid Fakhre Alam
Review of search and retrieval strategiesReview of search and retrieval strategies
Review of search and retrieval strategies
Abid Fakhre Alam849 views

More from Marijn Koolen

Recommender Systems NL Meetup by
Recommender Systems NL MeetupRecommender Systems NL Meetup
Recommender Systems NL MeetupMarijn Koolen
22 views21 slides
Narrative-Driven Recommendation for Casual Leisure Needs by
Narrative-Driven Recommendation for Casual Leisure NeedsNarrative-Driven Recommendation for Casual Leisure Needs
Narrative-Driven Recommendation for Casual Leisure NeedsMarijn Koolen
18 views26 slides
Digital History - Maritieme Carrieres bij de VOC by
Digital History - Maritieme Carrieres bij de VOCDigital History - Maritieme Carrieres bij de VOC
Digital History - Maritieme Carrieres bij de VOCMarijn Koolen
23 views23 slides
Facilitating reusable third-party annotations in the digital edition by
Facilitating reusable third-party annotations in the digital editionFacilitating reusable third-party annotations in the digital edition
Facilitating reusable third-party annotations in the digital editionMarijn Koolen
196 views60 slides
Narrative-Driven Recommendation for Casual Leisure Needs by
Narrative-Driven Recommendation for Casual Leisure NeedsNarrative-Driven Recommendation for Casual Leisure Needs
Narrative-Driven Recommendation for Casual Leisure NeedsMarijn Koolen
301 views32 slides
Scholary Web Annotation - HuC Live 2018 by
Scholary Web Annotation - HuC Live 2018Scholary Web Annotation - HuC Live 2018
Scholary Web Annotation - HuC Live 2018Marijn Koolen
62 views26 slides

More from Marijn Koolen(6)

Recommender Systems NL Meetup by Marijn Koolen
Recommender Systems NL MeetupRecommender Systems NL Meetup
Recommender Systems NL Meetup
Marijn Koolen22 views
Narrative-Driven Recommendation for Casual Leisure Needs by Marijn Koolen
Narrative-Driven Recommendation for Casual Leisure NeedsNarrative-Driven Recommendation for Casual Leisure Needs
Narrative-Driven Recommendation for Casual Leisure Needs
Marijn Koolen18 views
Digital History - Maritieme Carrieres bij de VOC by Marijn Koolen
Digital History - Maritieme Carrieres bij de VOCDigital History - Maritieme Carrieres bij de VOC
Digital History - Maritieme Carrieres bij de VOC
Marijn Koolen23 views
Facilitating reusable third-party annotations in the digital edition by Marijn Koolen
Facilitating reusable third-party annotations in the digital editionFacilitating reusable third-party annotations in the digital edition
Facilitating reusable third-party annotations in the digital edition
Marijn Koolen196 views
Narrative-Driven Recommendation for Casual Leisure Needs by Marijn Koolen
Narrative-Driven Recommendation for Casual Leisure NeedsNarrative-Driven Recommendation for Casual Leisure Needs
Narrative-Driven Recommendation for Casual Leisure Needs
Marijn Koolen301 views
Scholary Web Annotation - HuC Live 2018 by Marijn Koolen
Scholary Web Annotation - HuC Live 2018Scholary Web Annotation - HuC Live 2018
Scholary Web Annotation - HuC Live 2018
Marijn Koolen62 views

Recently uploaded

Shreyas hospital statistics.pdf by
Shreyas hospital statistics.pdfShreyas hospital statistics.pdf
Shreyas hospital statistics.pdfsamithavinal
5 views9 slides
Data about the sector workshop by
Data about the sector workshopData about the sector workshop
Data about the sector workshopinfo828217
29 views27 slides
Data Journeys Hard Talk workshop final.pptx by
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptxinfo828217
11 views18 slides
Dr. Ousmane Badiane-2023 ReSAKSS Conference by
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceAKADEMIYA2063
5 views34 slides
Listed Instruments Survey 2022.pptx by
Listed Instruments Survey  2022.pptxListed Instruments Survey  2022.pptx
Listed Instruments Survey 2022.pptxsecretariat4
121 views12 slides
PRIVACY AWRE PERSONAL DATA STORAGE by
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGEantony420421
7 views56 slides

Recently uploaded(20)

Shreyas hospital statistics.pdf by samithavinal
Shreyas hospital statistics.pdfShreyas hospital statistics.pdf
Shreyas hospital statistics.pdf
samithavinal5 views
Data about the sector workshop by info828217
Data about the sector workshopData about the sector workshop
Data about the sector workshop
info82821729 views
Data Journeys Hard Talk workshop final.pptx by info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821711 views
Dr. Ousmane Badiane-2023 ReSAKSS Conference by AKADEMIYA2063
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS Conference
AKADEMIYA20635 views
Listed Instruments Survey 2022.pptx by secretariat4
Listed Instruments Survey  2022.pptxListed Instruments Survey  2022.pptx
Listed Instruments Survey 2022.pptx
secretariat4121 views
PRIVACY AWRE PERSONAL DATA STORAGE by antony420421
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGE
antony4204217 views
Product Research sample.pdf by AllenSingson
Product Research sample.pdfProduct Research sample.pdf
Product Research sample.pdf
AllenSingson33 views
Lack of communication among family.pptx by ahmed164023
Lack of communication among family.pptxLack of communication among family.pptx
Lack of communication among family.pptx
ahmed16402315 views
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf by Oppotus
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf
Oppotus31 views
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion by Bertram Ludäscher
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo... by DataScienceConferenc1
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
LIVE OAK MEMORIAL PARK.pptx by ms2332always
LIVE OAK MEMORIAL PARK.pptxLIVE OAK MEMORIAL PARK.pptx
LIVE OAK MEMORIAL PARK.pptx
ms2332always7 views
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf by 10urkyr34
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
10urkyr347 views
4_4_WP_4_06_ND_Model.pptx by d6fmc6kwd4
4_4_WP_4_06_ND_Model.pptx4_4_WP_4_06_ND_Model.pptx
4_4_WP_4_06_ND_Model.pptx
d6fmc6kwd47 views

Search in Research, Let's Make it More Complex!

  • 1. Search in Research, Let’s Make it More Complex! Collaboratively Looking Under the Hood and Its Consequences Marijn Koolen Humanities Cluster - Royal Netherlands Academy of Arts and Sciences CLARIAH Media Studies Summer School Netherlands Institute for Sound and Vision, 3 July 2018
  • 2. Overview 1. Search in Research a. Search as part of research process b. Search vs. other access methods 2. Search, Retrieval and Ranking a. Retrieval Systems, Ranking Algorithms and Relevance Models 3. Searching in Digital Collections a. Understanding (digital) collections and their construction b. Tool analysis through experimentation 4. Search Strategies and Corpus Building a. Systematic searching b. Search strategies and sampling
  • 3. 1. Search in Research
  • 4. ● Research Phases ○ Exploration, gathering, analysis, synthesis, presentation ○ Extremely non-linear (affordance of digital realm) ● Search happens throughout research process ○ Search phases: pre-focus, focus, post-focus ○ Use different types of collections and search engines ■ General purpose search engines, ■ Domain- and collection-specific (e.g. GLAMS), ■ Personal/private (offline) collections ○ Search strategies: ■ Ad hoc or systematic: berrypicking (Bates 1989), keyword harvesting (Burke 2011), … ■ Important for data and tool criticism Research Process
  • 5. ● For many online materials access is limited to search interface ○ Browsing is guided by available structure ■ Drill down via facets ■ Navigate via metadata fields (if enabled) ○ Without (relevant) structure, direct search is only practical alternative ● Searching as exploration ○ How does search engine provide overview? ■ How big is collection? ■ How is collection structure communicated? ■ What (meta)data is available? ■ How are search characteristics explained? ■ How are search results summarised? Search Engine as Mediator
  • 6. ● Browsing brings you along unintended materials: ○ Navigating your way to relevance ○ Impresses on you what else there is (see also Putnam 2016) ● Keyword search tends to focus on relevance ○ Pushes back related/nearby materials ○ Collection structure can be enabled to allow faceting (overview) ● Search and research methodology ○ Impact of digital keyword search needs to be reflected in methodology ○ How do you account for search process in scholarly communication? ■ Method of citation is based on analogue browse/search in archives and libraries ■ Pre-focus to focus: switch between ad hoc and systematic? ■ Non-linearity: exploration never stops, assumptions constantly challenged Browsing vs. Keyword Searching
  • 7. 'To take a single example of this disconnect between research process and representation, many of us use and cite eighteenth and nineteenth-century newspapers as simple hard-copy references without mention of how we navigated to the specific article, page and issue. In doing so, we actively misrepresent the limitations within which we are working.' (Hitchcock 2013, 12) 'This is not only about being explicit about our use of keyword searching - it is about moving beyond a traditional form of scholarship to data modelling and to what Franco Moretti calls “distant reading”.' (Hitchcock, Confronting the Digital, 2013, p. 19). Keyword Search and “Confronting the Digital”
  • 8. Information Search and Seeking ● Search takes place in context ○ Part of seeking, and overall inf. behaviour (Wilson) ○ As inf. behaviour changes (phases), so does seeking and search behaviour ● Reflection-in-action ○ When and where are choice points? ○ How do search actions relate to strategy and inf. need?
  • 10. Search and Accountability ● What should scholars account for? ○ Aspects of sources, tools and process ● Digital source criticism ○ How to evaluate digital sources (Fickers 2012) ○ Who made digital source, when, why, what for, how? ● Digital tool criticism ○ How to evaluate impact of digital tools (Koolen et al. 2018) ○ Reflection-in-action, experimentation ● Data Scopes ○ How to communicate research process to others (Hoekstra & Koolen 2018) ○ Discuss process of selection, modelling, normalization, linking, classification
  • 11. 2. Search, Retrieval and Ranking
  • 13. Retrieval - Matching and Similarity ● Matching based on user query ○ Query: free text, controlled facet, example (doc, AV or text) ○ Matching docs returned in certain order (non-matching are not retrieved) ■ How does search engine perform matching (esp. for free text and example)? ■ Potentially many objects match query: does order matter? ● Similarity ○ Degree of matching: some match better than others (notion of similarity) ■ Retrieve most similar documents first (ranking) ○ Similar how? Does interface explain? ● Retrieval and ranking ○ Retrieval: which matching documents are returned to the user as results? ○ Ranking: in which order are the results returned?
  • 14. Retrieval, Ranking and Relevance ● Retrieval results form a set ○ Can be ordered or unordered (e.g. SQL or SPARQL query) ■ Even unordered sets need to be presented to the user in some order ○ Criteria for ordering: alphabetic, size, recency, popularity (views, likes, citations, links) ■ Ordering re-organizes materials, temporarily disrupts “original” organization ■ Provides different view on materials ● Many systems perform relevance ranking ○ Relevant to who or what? ■ Query: document similarity scores ■ User: e.g. search history, preferences ■ Situation: user, location, time, device, query, work context (page views, annotations) ■ Other aspects: quality, diversity, controversy, polarity, exploration/exploitation, ...
  • 15. ● How does an algorithm understand the notion of relevance? ○ Statistical interpretation: ■ Generally: frequent words carry less signal, look for unexpected stuff ■ Many ways of scoring signal ○ TF-IDF: ■ Term Frequency in document (relevance of term in document) ■ Inverse of Document Frequency in collection (commonness of term across docs) ○ Probabilistic Language Model (PLM): ■ Probability of picking term from document as bag of words (relevance of term in doc) ■ Probability of picking term from collection as bag of words (commonness of term) ○ Many other relevance models, e.g. BM25, DFR, SDM, … ■ Different interpretations of relevance, hence different rankings Algorithmic Interpretation of Relevance
  • 18. Ranking Issues ● Document length ○ TF-IDF doesn’t model document length, favours longer documents ○ PLM explicitly normalizes on document length, favours shorter documents ○ Upshot: Delpher API returns short documents first for short queries ● Document priors: are all documents equal or not? ○ Can use document prior probability (independent of query) ○ Can favour documents that are more popular, recent, authoritative, … ○ Can favour documents that are more appropriate for situation (location, time of day, …) ● Problem: how do you know how search engine scores relevance? ○ How much should you know about it? ○ Many GLAM search engines have relatively straightforward relevance models, no doc priors ○ Google uses many hundreds of features for document, query, user and situation
  • 19. Relevance in Metadata Records ● Relevance ranking of metadata records ○ Metadata records are peculiar textual representations ■ Minimal amount of text, low redundancy ■ Majority of terms occur only once ○ Which part of TF-IDF contributes more to score of metadata record? ○ Which fields are useful/used for matching? ● NISV collection ○ Search engine indexes metadata records ■ Some records have lengthy itemized descriptions, some have not ■ Some have transcripts, some have not ○ Consequences for retrieving? And for ranking? ■ How does search engine handle this? ■ How does search engine communicate this?
  • 20. ● Hard to match keywords against AV signal directly ○ Option: use text representation for AV document ■ E.g. metadata, description, script, speech transcript, ... ○ Option: use AV representation of query ■ E.g. example document or user recording ■ Use audio or visual similarity (again, similar how?) Retrieving and Ranking Audiovisual Materials
  • 21. ● Experiment to understand search functionalities ○ How can you find out if multiple search terms are treated with Boolean AND or OR operators? ○ How can you find out if terms are stemmed/normalized? ● Phrase search: ○ What happens when you use quotation marks to group terms into a phrase? ○ How do the results compare to those using no quotation marks? ● Proximity search: ○ Can you specify that terms should be near each other? ● Fuzzy search: wildcard and edit distance searches ○ Controlling lexical variation vs. uncontrolled wildcard search ○ voetbal+voetballen vs. voetbal* (matches voetbalvereniging, voetbalveld, ...) Opaqueness of Interfaces and Experimentation
  • 22. ● Experiment with Search and Compare tools of the CLARIAH Mediasuite ○ Find out if stopwords are removed ○ Find out if words are stemmed/normalized ○ Find out how multi-word queries are interpreted, i.e. as AND or OR ○ Find out how standard search operators work ■ Boolean AND, OR and NOT ■ Quotation marks for phrases Exercise
  • 23. 3. Searching in Digital Collections
  • 24. ● Collections of GLAMs are often built up over decades ○ Based on aims and selection criteria ■ Rarely "complete", dependent on availability of materials ○ Digital access via digitization, or digital archiving (born-digital) ■ Some things are lost in this process (e.g. context, quality, …) ● Heterogeneity: mix of object/source types (sub-collections) ○ Different modalities, different ways of accessing and presenting ■ Text vs. Image vs. AV vs. 3D (or 4D) Nature of Digital Collections
  • 25. Nature of Metadata ● Digital access via metadata ○ Metadata: data about the object/source ○ Types: formal, structural, technical, administrative, aboutness ○ Metadata fields allow selection and search via specific fields ■ Title, description, creator, creation date, genre, … ○ Allows (seemingly) uniform access to heterogeneous collections ■ But, different materials have different aspects to describe ■ Edition is relevant for books and films, not so much for paintings ● Metadata creation process ○ Often done with limited time, information and system flexibility ○ Inherently subjective, especially content analysis ● Size matters ○ Requirements change as size of collection grows (also depends on expectations)
  • 26. ● Hierarchical organization ○ 4 levels ■ Series: De Wereld Draait Door ■ Season: De Wereld Draait Door 2016 ■ Program: De Wereld Draait Door 21-06-2016 ■ Segment: De Wereld Draait Door 21-06-2016 ○ Each level has a metadata record (with overlap in field, e.g. title) ● Follows archival standard ○ Describe aspect at highest relevant level ○ Don’t repeat at lower levels unless it deviates (e.g. main titles) ○ Fonds: aggregation of documents from same origin Archival Structure and NISV Audiovisual Collection
  • 27. ● Power of the archive ○ Problem of perspective (from archive-as-source to archive-as-subject, Stoler 2002) ● History of the archive ○ Collections created over decades often go through changes in ■ selection criteria, cataloguers (human or algorithm), ■ cataloguing budgets, policies, rules, practice and vocabularies, ■ software (migrations and updates), hardware, ■ institutional mission, societal attitudes, … ○ Most of these aspects remain undocumented or partially documented ● Consequences ○ Almost inherently incomplete, inconsistent and sometimes necessarily incorrect ○ After many years, it's hard to retrace what happened ■ and how it affects access, selection and analysis Digital Source and Data Criticism
  • 28. Metadata in theory Metadata in practice Source: Jaap Kamps
  • 29. Combined Collections ● Several portals combine (heterogeneous) collections ○ Examples: ■ Europeana, European Newspapers, EU screen, Nederlab, Delpher, Online Archives of California, … ○ Worldwide aggregated collections: ■ ArchiveGrid (1000+ archives): over 5M finding aids ■ WorldCat (72,000 libraries): 400M records, 2.6B assets, 100M persons ● Huge challenge for source criticism as well as search ○ Collections vary in size, provenance, selection criteria, metadata policies, interpretation and richness ○ Heterogeneous metadata schemas have been mapped to single schema ■ Causes problems for interpretation ■ E.g. what does creator mean for paintings, films, tv series, letters, advertisements, ...?
  • 30. Assessing Metadata Quality ● Questions ○ What are pitfalls in relying on metadata? ○ How can we evaluate metadata quality? ○ What are relevant aspects to consider? ● Collection inspection ○ In CLARIAH Media Suite we created a tool for inspecting metadata ■ Esp. useful for complex collections like NISV audiovisual collection ■ Somewhat ad hoc, please feel encouraged to give feedback! ○ Please go to the Media Suite and go to the Collection Inspector tool ■ Click on “select field to analyse” and let the interface load the data on completeness (this will take awhile)
  • 34. Assessing Timelines and Other Visualizations ● Timeline visualizations give view of temporal spread ○ Very difficult to interpret properly ● Issues with absolute frequencies: ○ Collection materials not evenly distributed ○ Need to compare query-specific distribution to collection-distribution ● Issues with relative frequencies: ○ Incompleteness not evenly distributed (use collection inspector)
  • 35. Retrievability and Metadata Characteristics ● Different types of metadata fields ○ Controlled vocabulary: e.g. broadcast channel (radio or tv) ○ Number: number of episodes/seasons/segments ○ Time/date: program length, recording date ○ Free keyword/keyphrase: title, person name (tend to be non-unique) ○ Free text: description, summary, transcript, … (tend to be unique) ● Different types allow different forms of retrieval and ranking ○ Long text fields have more terms, with higher frequencies ■ Some types of programs have longer descriptions/transcript ■ These match more queries, so higher chance of being retrieved ■ Impact of long text fields on ranking depends on relevance model! ○ Repeated values allow aggregation, navigation
  • 36. ● Some search interfaces offer facets to narrow down search results ○ E.g. broadcaster and genre in the CLARIAH Media Suite ○ Facets provide overview, afford focusing through selection ● How do facets work? ○ Based on metadata fields: rich schema has rich options for facets ○ Types of metadata fields: controlled vocab, number, date, keyword/phrase, free text ■ Facets work for field with limited range of values, so not free text fields ○ Long tails in facets: typically, few high frequency, many low frequency values Metadata and Search Facets
  • 38. Exercise ● Experiment with the Collection Inspector of the CLARIAH Mediasuite ○ Try out the collection inspector: ■ Scroll through the list of fields to get an idea of what is available ■ Look at completeness of fields for f.i. “genre”, “keywords” and “awards” ■ Which metadata fields are relatively complete? ■ At which archival levels are they most complete? ● Explore which fields are available and which fields make good facets ○ Explore facet distributions in entire collection and for specific queries
  • 39. 4. Search Strategies and Corpus Building
  • 40. ● Importance of selection criteria ○ Do you have to hand pick each document? ○ Or can you select sets based on matching criteria? ○ Is representativeness important? If so, representativeness of what? ○ Or completeness? Why? ● Exploiting facets and dates ○ Filtering: align facets/dates with research focus ○ Sampling: compare across facets ■ Which facet types can you use? ○ Sampling strategies ■ Sample per facet/year (e.g. X items per facet/year) ■ Within facets, select random or not Searching for Corpus Building
  • 41. Tracking Context in Corpus Building ● Why were certain documents selected? ○ How were they selected? ○ What strategy was used? ○ Documenting helps understanding/remembering choices? ● Do research goals and questions change during collection? ○ Interacting with sources during search updates knowledge structures (Vakkari 2016) ○ Updates tend to be small and incremental, hence barely noticeable ○ Explicit reflection-in-action can bring these to the surface (Koolen et al. 2018) ○ Adding annotations can also provide context
  • 42. Systematic Searching ● Systematic (comprehensive) search has two factors (Yakel 2010): ○ Search strategy (user) ○ Search functionalities (system) ○ Functionalities shape/affect strategy ● Step 1: systematic search for relevant collections online ○ Different collections/sites offer different search functionalities and levels of detail ○ Explicitly address what consequences this has for your strategy and research goals ● Step 2: ○ Explore individual collections using one or more strategies ○ "Researchers need to be flexible and creative to accommodate the vagaries of cataloging practices." (Yakel 2010, p. 110) ○ Footnote and reference chasing: references often give an "information scent", suggesting other collections and items to explore.
  • 43. Search Strategies ● Web search strategies defined by Drabenstott (2001) ○ Discussed in archive context by Yakel (2010) ● Five strategies ○ Synonym generation ○ Chaining ○ Name collection ○ Pearl growing ○ Successive segmentation ● Somewhat related to information seeking patterns by Ellis (1989) ○ Starting, chaining, browsing, differentiating, monitoring, extracting
  • 44. ● Synonym generation: 1) search with relevant term, 2) close read results to identify related terms (wordclouds, facets), 3) search via related terms for synonyms. ● Chaining: follow references/citations (explicit or implicit), identify relevant subset and use explicit structure to explore connected/related subset ● Name collection: search with keywords, identify relevant names, search with names, identify related names and keywords, repeat. Similar to keyword harvesting (Burke 2011). Drabenstott’s Strategies (1/2)
  • 45. Drabenstott’s Strategies (2/2) ● Pearl growing: start small and focused with specific search terms, slowly expand out with additional terms to broader topics/themes ● Successive segmentation: opposite of pearl growing; start broad and increasingly zoom in and focus; e.g. make queries increasingly specific by adding (ANDing) keywords, replace broad terms with lower frequency terms, or select facets
  • 46. Search Strategies and Research Phases ● Research phase ○ Exploration <-> search phase pre-focus i. Ad hoc, no need yet for systematic search ii. Mostly pearl growing and/or successive segmentation to determine focus ○ Analysis <-> search phase focus i. Switch to systematic, determine strategy ii. Use chaining, name collection, synonym generation (for coverage/representation, boundaries) ● But reality resists: ○ (Re)search process is very non-linear ○ Boundary between exploration and analysis is not always clear ○ Late discoveries can prompt or force new directions, ...
  • 47. When To Stop ● Often switch from exploration to “sorta” systematic search ○ But hard to remember and explain what and how you searched ○ Moreover, difficult to determine when to stop ○ Explicit strategy allows for stopping criteria ● Stopping criteria ○ Check whole set/sample, all available facets, ... ○ Diminishing returns: you increasingly encounter seen things, new relevance becomes rare ○ When stopping, make explicit (at least for yourself) when and why you stopped ● Meta-strategy: ○ chance strategy/tactics ○ E.g. successive segmentation -> harvest keywords -> switch segment -> harvest keywords, ...
  • 48. Wrap Up ● Search in research ○ How to incorporate these processes in research methodology ● Large, heterogeneous collections introduce issues for research ○ Assessing incompleteness of materials ○ Assessing incompleteness, incorrectness and inconsistency of metadata ● Looking under the hood ○ Evaluating information access functionalities (search and browse) ○ Selecting an appropriate search strategy for research goals ○ Determining success/failure of searches ○ Understanding search for corpus building
  • 49. Burke, T. 2011. How I Talk About Searching, Discovery and Research in Courses. May 9, 2011. Drabenstott, K.M., 2001. Web Search Strategy Development. Online, 25(4), pp.18-25. Fickers, F. 2012. Towards a New Digital Historicism? Doing History in the Age of Abundance. View journal, volume 1 (1). http://orbilu.uni.lu/bitstream/10993/7615/1/4-4-1-PB.pdf Hitchcock, T. 2013. Confronting the Digital - Or How Academic History Writing Lost the Plot. Cultural and Social History, Volume 10, Issue 1, pp. 9-23. https://doi.org/10.2752/147800413X13515292098070 Hoekstra, R., M. Koolen. 2018. Data Scopes for Digital History Research. Historical Methods: A Journal of Quantitative and Interdisciplinary History, Volume 51 (2), 2018. References
  • 50. References Koolen, M., J. van Gorp, J. van Ossenbruggen. 2018. Lessons Learned from a Digital Tool Criticism Workshop. Digital Humanities in the Benelux 2018 Conference. Putnam L. 2016. The Transnational and the Text-Searchable: Digitized Sources and the Shadows They Cast. American Historical Review, Volume 121, Number 2, pp. 377-402. Vakkari, P. 2016. Searching as Learning: A systematization based on literature. Journal of Information Science, 42(1) 2016, pp. 7-18. Yakel, E., 2010. Searching and seeking in the deep web: Primary sources on the internet. Working in the archives: Practical research methods for rhetoric and composition, pp.102-118.