Tools that Encourage Criticism - Leiden University Symposium on Tools CriticismMarijn Koolen
The use of research tools in digital humanities requires critical reflection by the researcher, but also by developers of tools and research infrastructure.
Influence of Timeline and Named-entity Components on User Engagement Roi Blanco
Nowadays, successful applications are those which contain features that captivate and engage users. Using an interactive news retrieval system as a use case, in this paper we study the effect of timeline and named-entity components on user engagement. This is in contrast with previous studies where the importance of these components were studied from a retrieval effectiveness point of view. Our experimental results show significant improvements in user engagement when named-entity and timeline components were installed. Further, we investigate if we can predict user-centred metrics through user's interaction with the system. Results show that we can successfully learn a model that predicts all dimensions of user engagement and whether users will like the system or not. These findings might steer systems that apply a more personalised user experience, tailored to the user's preferences.
Matrix Queries and Matrix Data Representations in NVivo 11 PlusShalin Hai-Jew
This slideshow, "Matrix Queries and Matrix Data Representations in NVivo 11 Plus," covers the following points:
Matrices and their basic structures
Types of elements (variables) for matrix comparisons
Setting up matrix queries in NVivo 11
Specific matrix “use cases” in qualitative and mixed methods research
Wrap-up
Writing and Publishing about Applied Technologies in Tech Journals and BooksShalin Hai-Jew
This slideshow provides insights on how to write and publish about applied technologies in tech journals and books, including the following:
Getting started in tech publishing
Cost-benefit calculations
Parts to an article; parts to a chapter
Writing process
Collaborating
Publishing process
Acquiring readers (and citations)
Post-publishing
Next works
Using “Distant Reading” to Explore Discussion Threads in Online CoursesShalin Hai-Jew
In this age of mass data, “distant reading” has come to the fore as a way to deal with large amounts of text data—including from student discussion threads in online courses. Kansas State University has a site license for NVivo 11 Plus, a software that enables multimedia data curation and qualitative and mixed methods data analysis. Two new features in NVivo—sentiment analysis and theme extraction (topic modeling)—enable users to “distant read” large amounts of text to extract some early insights.
What are the expressed sentiments of learners when discussing a particular issue? Do these trend positive or negative?
What topics or themes or concepts are brought up by students given a certain discussion thread prompt?
What do the sentiment and topic insights suggest about where students are at with a particular issue? Are there latent (hidden) insights?
These new features, in combination with text frequency counts (with related text clustering), text searches, and other text data query capabilities (and related data visualization capabilities) in NVivo enable distant reading for use in online courses. This digital slideshow will introduce NVivo 11 Plus (a local software tool with both Windows and Mac platform versions) and walk-through how it may be applied to textual data extracted from an online course.
Understanding what students are thinking is a critical part of transformational teaching and learning. Using computational means to listen and to hear is important to this end.
Tools that Encourage Criticism - Leiden University Symposium on Tools CriticismMarijn Koolen
The use of research tools in digital humanities requires critical reflection by the researcher, but also by developers of tools and research infrastructure.
Influence of Timeline and Named-entity Components on User Engagement Roi Blanco
Nowadays, successful applications are those which contain features that captivate and engage users. Using an interactive news retrieval system as a use case, in this paper we study the effect of timeline and named-entity components on user engagement. This is in contrast with previous studies where the importance of these components were studied from a retrieval effectiveness point of view. Our experimental results show significant improvements in user engagement when named-entity and timeline components were installed. Further, we investigate if we can predict user-centred metrics through user's interaction with the system. Results show that we can successfully learn a model that predicts all dimensions of user engagement and whether users will like the system or not. These findings might steer systems that apply a more personalised user experience, tailored to the user's preferences.
Matrix Queries and Matrix Data Representations in NVivo 11 PlusShalin Hai-Jew
This slideshow, "Matrix Queries and Matrix Data Representations in NVivo 11 Plus," covers the following points:
Matrices and their basic structures
Types of elements (variables) for matrix comparisons
Setting up matrix queries in NVivo 11
Specific matrix “use cases” in qualitative and mixed methods research
Wrap-up
Writing and Publishing about Applied Technologies in Tech Journals and BooksShalin Hai-Jew
This slideshow provides insights on how to write and publish about applied technologies in tech journals and books, including the following:
Getting started in tech publishing
Cost-benefit calculations
Parts to an article; parts to a chapter
Writing process
Collaborating
Publishing process
Acquiring readers (and citations)
Post-publishing
Next works
Using “Distant Reading” to Explore Discussion Threads in Online CoursesShalin Hai-Jew
In this age of mass data, “distant reading” has come to the fore as a way to deal with large amounts of text data—including from student discussion threads in online courses. Kansas State University has a site license for NVivo 11 Plus, a software that enables multimedia data curation and qualitative and mixed methods data analysis. Two new features in NVivo—sentiment analysis and theme extraction (topic modeling)—enable users to “distant read” large amounts of text to extract some early insights.
What are the expressed sentiments of learners when discussing a particular issue? Do these trend positive or negative?
What topics or themes or concepts are brought up by students given a certain discussion thread prompt?
What do the sentiment and topic insights suggest about where students are at with a particular issue? Are there latent (hidden) insights?
These new features, in combination with text frequency counts (with related text clustering), text searches, and other text data query capabilities (and related data visualization capabilities) in NVivo enable distant reading for use in online courses. This digital slideshow will introduce NVivo 11 Plus (a local software tool with both Windows and Mac platform versions) and walk-through how it may be applied to textual data extracted from an online course.
Understanding what students are thinking is a critical part of transformational teaching and learning. Using computational means to listen and to hear is important to this end.
"Mass Surveillance" through Distant ReadingShalin Hai-Jew
Distant reading refers to the uses of computers to “read” texts by counting words, identifying themes and subthemes (through topic modeling), extracting sentiment, applying psychological analysis to the author(s), and otherwise finding latent or hidden insights. This work is based on research on “mass surveillance” based on five text sets: academic, mainstream journalism, microblogging, Wikipedia articles, and leaked government data. The purpose was to capture some insights about the collective social discussions occurring around this issue in an indirect way. This presentation uses a variety of data visualizations (article network graphs, word trees, dendrograms, treemaps, cluster diagrams, line graphs, bar charts, pie charts, and others) to show how machines read and the types of summary data they enable (at computational speeds, at machine scale, and in a reproducible way). Also, some computational linguistic analysis tools enable the creation of custom dictionaries for unique types of applied research. The tools used in this presentation include NVivo 11 Plus and LIWC2015.
Text REtrieval Conference (TREC) Dynamic Domain Track 2015Grace Hui Yang
This is the introductory talk for the TREC Dynamic Domain Track. The Track ran from 2015 to 2017, aiming to evaluate and advance research in dynamic search and domain-specific search. This talk was prepared to introduce the ideas and setups in the upcoming Track to the research community.
This webinar will share the design of the BISG’s new Educational Taxonomy along with ideas on how Canadian publishers can use the terms to aid discovery of appropriate educational materials and resources online. It will also show the importance of related metadata elements for educational titles.
BISG’s new Educational Taxonomy lists the key terms that describe current and emerging educational standards, concepts, learning objectives, and methodologies used in the classroom, such as Common Core State Standards, Next Generation Science Standards, and more. It was developed by the BISG Educational Standards Taxonomy Working Group and incorporates feedback from potential users and from educators and librarians on their search habits. It is available for free download here.
About the host:
Patricia Payton, Senior Manager of Publisher Relations and Content Development for Bowker, a ProQuest Affiliate, is responsible communicating book and journal metadata requirements and best practices to publishers of all sizes. Patricia has experience in retail bookstores as well as international markets. She also holds a Master’s degree in Library Information Science specializing in Digital Libraries as well as an MBA. She actively contributes to BISG, AAP, and other industry committees. You can find her on twitter @Metadata24X7.
This course provides a detailed executive-level review of contemporary topics in graph modeling theory with specific focus on Deep Learning theoretical concepts and practical applications. The ideal student is a technology professional with a basic working knowledge of statistical methods.
Upon completion of this review, the student should acquire improved ability to discriminate, differentiate and conceptualize appropriate implementations of application-specific (‘traditional’ or ‘rule-based’) methods versus deep learning methods of statistical analyses and data modeling. Additionally, the student should acquire improved general understanding of graph models as deep learning concepts with specific focus on state-of-the-art awareness of deep learning applications within the fields of character recognition, natural language processing and computer vision. Optionally, the provided code base will inform the interested student regarding basic implementation of these models in Keras using Python (targeting TensorFlow, Theano or Microsoft Cognitive Toolkit).
Link to course:
https://www.experfy.com/training/courses/graph-models-for-deep-learning
Capitalizing on Machine Reading to Engage Bigger DataShalin Hai-Jew
What are some ways to select, say, 200 research articles to “close read” from a set of 2,000 PDF articles gleaned from library databases and Google Scholar? How can a researcher make sense of a trending issue in the flood of Tweets and RT based on a particular hashtag (#) or keyword search or an especially lively Tweetstream based on a particular social media account? People are dealing with ever more prodigious amounts of information—from a number of sources. Those who are savvy to the uses of computers to aid their reading (through “distant reading” or “not-reading”) may find that they are able to cover much more ground. This presentation introduces the use of NVivo 11 Plus (matrix queries, word frequency counts, text searches and dendrograms, cluster analyses, topic modeling, and others) for multiple cases of distant reading to aid in academic and research work.
Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10Shalin Hai-Jew
An experimental feature in NVivo 10 (circa 2013), Autocoding by Existing Pattern, enables the application of semi-supervised machine learning to ingested research data. This results in the extraction of themes and other relevant insights from data—at machine speeds, based on the classification algorithm. This presentation will introduce this feature in NVivo 10 (on both Windows and Mac platforms). This will show how the machine can achieve high inter-rater reliability (a Cohen’s Kappa of one in many cases) on the one hand but still not achieve full human sensibility from “close reading” coding on the other. This presentation will suggest a complementary balance between machine- and human- coding of qualitative and mixed methods data for the most efficient application of researcher time and expertise.
Spatial Decision Support Portal- Presented at AAG 2010Nathan Strout
A presentation prepared for the American Association of Geographers (AAG) 2010 Annual Meeting in Washington DC. The presentation discusses work done by the University of Redlands and the SDS Consortium to organize and provide access to the body of knowledge regarding Spatial Decision Support
The Humanities Cluster invests a lot of effort in developing infrastructure and tools for digital research. As scholars we want those tools to be easy to use and don't want to bother with many of the technical details. But their ease of use often makes it hard to check if there is a devil in those details who we should want to meet. Digital tools can do a lot of work for us, but only because they are based on a lot of assumptions. Which of these assumptions are important to consider in research? And how can we develop infrastructure and tools that wear their assumptions on their sleeves and that invite us to reflect on their impact? In this talk I will present our research in attempting to address these questions. We have developed conceptual frameworks and techniques for digital tool criticism and evaluation and for thinking and communicating about digital data processes in research. I will discuss the lessons we have learned from bringing these frameworks and techniques into practice and how we can incorporate these lessons in digital humanities research methodology and in developing digital infrastructure.
A hands-on approach to digital tool criticism: Tools for (self-)reflectionMarijn Koolen
Digital tool criticism is a recent and important discussion in Digital Humanities research. We define digital tool criticism as the reflection on the role of digital tools in the research methodology and the evaluation of the suitability of a given digital tool for a specific research goal. The aim is to understand the impact of any limitation of the tool on the specific goal, not to improve a tool’s performance. That is, ensuring as a scholar to be aware of the impact of a tool on research design, methods, interpretations and outcomes. Our goal with developing digital tool criticism as a method is to help scholars better understand how research methods, tools and activities shape our interpretations. Based on our experiences with two hands-on workshops on digital tool criticism, we find that reflection on using digital tools and data in all phases of the research process is key.
Reflection urges scholars to consider digital data and tools as part of the overall research goals and design, and interdependent with other elements of research design, namely research questions and methods. As scholars go through their research process, assumptions on the research design and the connection between tools, data and questions are constantly challenged, forcing updates in the design and the interpretation of data and question.
Data Scopes - Towards transparent data research in digital humanities (Digita...Marijn Koolen
Data scopes describe the process of data gathering, cleaning and combining in digital humanities research, which is too often considered as mere preparation that is not part of research, and is mostly not described in scholarly communications. We argue that scholars need to be more aware of the intellectual effort of this process and make it more transparent
Managing Ireland's Research Data - 3 Research MethodsRebecca Grant
Slides providing an overview of the research methods used in the author's thesis, "Managing Ireland's Research Data: Recognising Roles for Recordkeepers". The methods discussed are online surveys, comparative case studies, and autoethnography.
Licensed as CC-BY.
This presentation was provided by Starr Hoffman of Director, Planning & Assessment, University of Nevada – Las Vegas during the NISO event, NISO Training Series: Assessment Practices and Metrics for the 21st Century, held on Friday, October 26, 2018.
Search & Recommendation: Birds of a Feather?Toine Bogers
In just a little over half a century, the field of information retrieval has experienced spectacular growth and success, with IR applications such as search engines becoming a billion-dollar industry in the past decades. Recommender systems have seen an even more meteoric rise to success with wide-scale application by companies like Amazon, Facebook, and Netflix. But are search and recommendation really two different fields of research that address different problems with different sets of algorithms in papers published at distinct conferences?
In my talk, I want to argue that search and recommendation are more similar than they have been treated in the past decade. By looking more closely at the tasks and problems that search and recommendation try to solve, at the algorithms used to solve these problems and at the way their performance is evaluated, I want to show that there is no clear black and white division between the two. Instead, search and recommendation are part of a much more fluid continuum of methods and techniques for information access.
(Keynote at "Mind The Gap '14" workshop at the iConference 2014 in Berlin, Germany)
"Mass Surveillance" through Distant ReadingShalin Hai-Jew
Distant reading refers to the uses of computers to “read” texts by counting words, identifying themes and subthemes (through topic modeling), extracting sentiment, applying psychological analysis to the author(s), and otherwise finding latent or hidden insights. This work is based on research on “mass surveillance” based on five text sets: academic, mainstream journalism, microblogging, Wikipedia articles, and leaked government data. The purpose was to capture some insights about the collective social discussions occurring around this issue in an indirect way. This presentation uses a variety of data visualizations (article network graphs, word trees, dendrograms, treemaps, cluster diagrams, line graphs, bar charts, pie charts, and others) to show how machines read and the types of summary data they enable (at computational speeds, at machine scale, and in a reproducible way). Also, some computational linguistic analysis tools enable the creation of custom dictionaries for unique types of applied research. The tools used in this presentation include NVivo 11 Plus and LIWC2015.
Text REtrieval Conference (TREC) Dynamic Domain Track 2015Grace Hui Yang
This is the introductory talk for the TREC Dynamic Domain Track. The Track ran from 2015 to 2017, aiming to evaluate and advance research in dynamic search and domain-specific search. This talk was prepared to introduce the ideas and setups in the upcoming Track to the research community.
This webinar will share the design of the BISG’s new Educational Taxonomy along with ideas on how Canadian publishers can use the terms to aid discovery of appropriate educational materials and resources online. It will also show the importance of related metadata elements for educational titles.
BISG’s new Educational Taxonomy lists the key terms that describe current and emerging educational standards, concepts, learning objectives, and methodologies used in the classroom, such as Common Core State Standards, Next Generation Science Standards, and more. It was developed by the BISG Educational Standards Taxonomy Working Group and incorporates feedback from potential users and from educators and librarians on their search habits. It is available for free download here.
About the host:
Patricia Payton, Senior Manager of Publisher Relations and Content Development for Bowker, a ProQuest Affiliate, is responsible communicating book and journal metadata requirements and best practices to publishers of all sizes. Patricia has experience in retail bookstores as well as international markets. She also holds a Master’s degree in Library Information Science specializing in Digital Libraries as well as an MBA. She actively contributes to BISG, AAP, and other industry committees. You can find her on twitter @Metadata24X7.
This course provides a detailed executive-level review of contemporary topics in graph modeling theory with specific focus on Deep Learning theoretical concepts and practical applications. The ideal student is a technology professional with a basic working knowledge of statistical methods.
Upon completion of this review, the student should acquire improved ability to discriminate, differentiate and conceptualize appropriate implementations of application-specific (‘traditional’ or ‘rule-based’) methods versus deep learning methods of statistical analyses and data modeling. Additionally, the student should acquire improved general understanding of graph models as deep learning concepts with specific focus on state-of-the-art awareness of deep learning applications within the fields of character recognition, natural language processing and computer vision. Optionally, the provided code base will inform the interested student regarding basic implementation of these models in Keras using Python (targeting TensorFlow, Theano or Microsoft Cognitive Toolkit).
Link to course:
https://www.experfy.com/training/courses/graph-models-for-deep-learning
Capitalizing on Machine Reading to Engage Bigger DataShalin Hai-Jew
What are some ways to select, say, 200 research articles to “close read” from a set of 2,000 PDF articles gleaned from library databases and Google Scholar? How can a researcher make sense of a trending issue in the flood of Tweets and RT based on a particular hashtag (#) or keyword search or an especially lively Tweetstream based on a particular social media account? People are dealing with ever more prodigious amounts of information—from a number of sources. Those who are savvy to the uses of computers to aid their reading (through “distant reading” or “not-reading”) may find that they are able to cover much more ground. This presentation introduces the use of NVivo 11 Plus (matrix queries, word frequency counts, text searches and dendrograms, cluster analyses, topic modeling, and others) for multiple cases of distant reading to aid in academic and research work.
Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10Shalin Hai-Jew
An experimental feature in NVivo 10 (circa 2013), Autocoding by Existing Pattern, enables the application of semi-supervised machine learning to ingested research data. This results in the extraction of themes and other relevant insights from data—at machine speeds, based on the classification algorithm. This presentation will introduce this feature in NVivo 10 (on both Windows and Mac platforms). This will show how the machine can achieve high inter-rater reliability (a Cohen’s Kappa of one in many cases) on the one hand but still not achieve full human sensibility from “close reading” coding on the other. This presentation will suggest a complementary balance between machine- and human- coding of qualitative and mixed methods data for the most efficient application of researcher time and expertise.
Spatial Decision Support Portal- Presented at AAG 2010Nathan Strout
A presentation prepared for the American Association of Geographers (AAG) 2010 Annual Meeting in Washington DC. The presentation discusses work done by the University of Redlands and the SDS Consortium to organize and provide access to the body of knowledge regarding Spatial Decision Support
The Humanities Cluster invests a lot of effort in developing infrastructure and tools for digital research. As scholars we want those tools to be easy to use and don't want to bother with many of the technical details. But their ease of use often makes it hard to check if there is a devil in those details who we should want to meet. Digital tools can do a lot of work for us, but only because they are based on a lot of assumptions. Which of these assumptions are important to consider in research? And how can we develop infrastructure and tools that wear their assumptions on their sleeves and that invite us to reflect on their impact? In this talk I will present our research in attempting to address these questions. We have developed conceptual frameworks and techniques for digital tool criticism and evaluation and for thinking and communicating about digital data processes in research. I will discuss the lessons we have learned from bringing these frameworks and techniques into practice and how we can incorporate these lessons in digital humanities research methodology and in developing digital infrastructure.
A hands-on approach to digital tool criticism: Tools for (self-)reflectionMarijn Koolen
Digital tool criticism is a recent and important discussion in Digital Humanities research. We define digital tool criticism as the reflection on the role of digital tools in the research methodology and the evaluation of the suitability of a given digital tool for a specific research goal. The aim is to understand the impact of any limitation of the tool on the specific goal, not to improve a tool’s performance. That is, ensuring as a scholar to be aware of the impact of a tool on research design, methods, interpretations and outcomes. Our goal with developing digital tool criticism as a method is to help scholars better understand how research methods, tools and activities shape our interpretations. Based on our experiences with two hands-on workshops on digital tool criticism, we find that reflection on using digital tools and data in all phases of the research process is key.
Reflection urges scholars to consider digital data and tools as part of the overall research goals and design, and interdependent with other elements of research design, namely research questions and methods. As scholars go through their research process, assumptions on the research design and the connection between tools, data and questions are constantly challenged, forcing updates in the design and the interpretation of data and question.
Data Scopes - Towards transparent data research in digital humanities (Digita...Marijn Koolen
Data scopes describe the process of data gathering, cleaning and combining in digital humanities research, which is too often considered as mere preparation that is not part of research, and is mostly not described in scholarly communications. We argue that scholars need to be more aware of the intellectual effort of this process and make it more transparent
Managing Ireland's Research Data - 3 Research MethodsRebecca Grant
Slides providing an overview of the research methods used in the author's thesis, "Managing Ireland's Research Data: Recognising Roles for Recordkeepers". The methods discussed are online surveys, comparative case studies, and autoethnography.
Licensed as CC-BY.
This presentation was provided by Starr Hoffman of Director, Planning & Assessment, University of Nevada – Las Vegas during the NISO event, NISO Training Series: Assessment Practices and Metrics for the 21st Century, held on Friday, October 26, 2018.
Search & Recommendation: Birds of a Feather?Toine Bogers
In just a little over half a century, the field of information retrieval has experienced spectacular growth and success, with IR applications such as search engines becoming a billion-dollar industry in the past decades. Recommender systems have seen an even more meteoric rise to success with wide-scale application by companies like Amazon, Facebook, and Netflix. But are search and recommendation really two different fields of research that address different problems with different sets of algorithms in papers published at distinct conferences?
In my talk, I want to argue that search and recommendation are more similar than they have been treated in the past decade. By looking more closely at the tasks and problems that search and recommendation try to solve, at the algorithms used to solve these problems and at the way their performance is evaluated, I want to show that there is no clear black and white division between the two. Instead, search and recommendation are part of a much more fluid continuum of methods and techniques for information access.
(Keynote at "Mind The Gap '14" workshop at the iConference 2014 in Berlin, Germany)
Search, Report, Wherever You Are: A Novel Approach to Assessing User Satisfac...Rachel Vacek
In an effort to assess user experience and satisfaction with searching the University of Michigan Library catalog, we developed an online data collection tool that captured both data on user searches and their reports on various aspects of the search experience. We successfully piloted the tool, demonstrating both the usefulness of the assessment data and the readiness of the tool for use with a larger group of campus stakeholders. We focus in this paper on the features and deployment of the data collection tool, and we also discuss our pilot phase findings and our plan to use the tool in future assessment work.
A Research Plan to Study Impact of a Collaborative Web Search Tool on Novice'...Karthikeyan Umapathy
In the past decade, research efforts dedicated to studying the process of collaborative web search have been on the rise. Yet, limited number of studies have examined the impact of collaborative information search process on novice’s query behaviors. Studying and analyzing factors that influence web search behaviors, specifically users’ patterns of queries when using collaborative search systems can help with making query suggestions for group users. Improvements in user query behaviors and system query suggestions help in reducing search time and increasing query success rates for novices. In this paper, we present an empirical study plan designed to investigate the influence of collaboration between experts and novices as well as use of a collaborative web search tool on novice’s query behavior. In this research-in-progress study, we intend to use SearchTeam as our collaborative search tool. The results of this study are expected to provide information that could help collaborative web search tool designers to find ways to improve the query suggestions feature for group users. Additionally, this study will test the hypothesis that – having domain experts working with non-experts using collaborative search systems would immensely increase the query success rates for non-expert users, and help them learn querying strategies over the course of time. If the above hypothesis is proven, then use of collaborative web search tools during training of interns would be highly recommended.
Using Qualitative Methods for Library Evaluation: An Interactive WorkshopOCLC
Connaway, Lynn Silipigni, and Marie L. Radford. 2016. "Using Qualitative Methods for Library Evaluation: An Interactive Workshop." Presented at the Libraries in the Digital Age (LIDA) Conference, Zadar, Croatia, June 14.
Using Qualitative Methods for Library Evaluation: An Interactive WorkshopLynn Connaway
Connaway, Lynn Silipigni, and Marie L. Radford. 2016. "Using Qualitative Methods for Library Evaluation: An Interactive Workshop." Presented at the Libraries in the Digital Age (LIDA) Conference, Zadar, Croatia, June 14.
This presentation was provided by Amanda Wheatley and Sandy Hervieux of McGill University, during the NISO Webinar "Discover and Online Search, Part Two: Personalized Content, Personal Data," which was held on June 19, 2019.
News recommenders have the potential to help users filter the enormous amount of news that is available online, and as such may play an important role in determining what information users do and do not get to see. However, current approaches to evaluating recommender systems are often focused on measuring an increase in user clicks and short-term engagement, rather than measuring the user's and society’s longer term interest in diverse and important recommendations. In this talk we aim to bridge the gap between so-called normative notions of news diversity, as it is known in social sciences and specifically democratic theory, and quantitative metrics necessary for evaluating the recommender system. We discuss a number of democratic missions a recommender system could have, together with a set of evaluation metrics stemming from these missions, and suggest ways for practical implementations of these metrics.
The talk will be about practical considerations that our team has had to make in order to bring a recommender system into production. I’ll cover the “default” tools with which we started (Batch processing in Spark) and follow that up with more recent tools like AWS Lambda and Spark Streaming.
Narrative-Driven Recommendation for Casual Leisure NeedsMarijn Koolen
Many information needs for leisure (book, films, games, music) are highly complex and cover many different relevance aspects. This is an investigation into the nature of human-directed, natural language statements of casual leisure needs across four domains, and a discussion of their implications of conversational search and recommendation systems.
Digital History - Maritieme Carrieres bij de VOCMarijn Koolen
Digital History lecture about modelling the maritime careers of sailors at the Dutch East India Company and the challenges of gathering, selecting, modelling, normalizing and classifying historical data.
Facilitating reusable third-party annotations in the digital editionMarijn Koolen
We argue the need for support of annotations on an edition made by researchers unaffiliated to the edition project, as a contribution to the explanatory material already present on the site, for purposes of private study or for publication in conjunction with a scholarly article. We demonstrate our annotation approach, that exploits RDFa for embedding the edition-specific semantics and identifier in the edition's HTML pages. We discuss an
FRBROO- based ontology of the editorial domain, capable of describing both the objects of editing (Text and Document) and their representation in the edition. We have a fully functional and open source prototype of an annotation tool that over the coming years will be actively developed, for use in multiple disciplines.
Narrative-Driven Recommendation for Casual Leisure NeedsMarijn Koolen
Recommender systems typically generate recommendations for a user based on their profile, or for an item given its user interactions, but there are many scenarios especially in leisure domains such as books, movies, games and music, where users have specific recommendation needs, where they want to steer the recommendation process towards certain aspects they find relevant. Currently, there are few recommender or search systems that can deal with the complexity of such directed needs, nor do we know well which data types (metadata, user ratings and reviews, item content) are useful to match against different aspects of recommendation needs. There are many discussion forums where users describe their needs and their frustration with current search and recommender systems. In this talk I will summarize our work on analyzing relevance aspects for these needs and describe experiments on dealing with these.
Scholary Web Annotation - HuC Live 2018Marijn Koolen
Web annotation has a lot of potential for scholarly research but current tools have several big limitations. At the Humanities Cluster of the Royal Netherlands Academy of Arts and Sciences we are developing a scholarly web annotation tool that allows different types of fine-grained annotations on objects of any media type and combining manual and algorithmic annotations.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Search in Research, Let's Make it More Complex!
1. Search in Research, Let’s Make it More Complex!
Collaboratively Looking Under the Hood and Its
Consequences
Marijn Koolen
Humanities Cluster - Royal Netherlands Academy of Arts and Sciences
CLARIAH Media Studies Summer School
Netherlands Institute for Sound and Vision, 3 July 2018
2. Overview
1. Search in Research
a. Search as part of research process
b. Search vs. other access methods
2. Search, Retrieval and Ranking
a. Retrieval Systems, Ranking Algorithms and Relevance Models
3. Searching in Digital Collections
a. Understanding (digital) collections and their construction
b. Tool analysis through experimentation
4. Search Strategies and Corpus Building
a. Systematic searching
b. Search strategies and sampling
4. ● Research Phases
○ Exploration, gathering, analysis, synthesis, presentation
○ Extremely non-linear (affordance of digital realm)
● Search happens throughout research process
○ Search phases: pre-focus, focus, post-focus
○ Use different types of collections and search engines
■ General purpose search engines,
■ Domain- and collection-specific (e.g. GLAMS),
■ Personal/private (offline) collections
○ Search strategies:
■ Ad hoc or systematic: berrypicking (Bates 1989), keyword harvesting (Burke 2011), …
■ Important for data and tool criticism
Research Process
5. ● For many online materials access is limited to search interface
○ Browsing is guided by available structure
■ Drill down via facets
■ Navigate via metadata fields (if enabled)
○ Without (relevant) structure, direct search is only practical alternative
● Searching as exploration
○ How does search engine provide overview?
■ How big is collection?
■ How is collection structure communicated?
■ What (meta)data is available?
■ How are search characteristics explained?
■ How are search results summarised?
Search Engine as Mediator
6. ● Browsing brings you along unintended materials:
○ Navigating your way to relevance
○ Impresses on you what else there is (see also Putnam 2016)
● Keyword search tends to focus on relevance
○ Pushes back related/nearby materials
○ Collection structure can be enabled to allow faceting (overview)
● Search and research methodology
○ Impact of digital keyword search needs to be reflected in methodology
○ How do you account for search process in scholarly communication?
■ Method of citation is based on analogue browse/search in archives and libraries
■ Pre-focus to focus: switch between ad hoc and systematic?
■ Non-linearity: exploration never stops, assumptions constantly challenged
Browsing vs. Keyword Searching
7. 'To take a single example of this disconnect between research process and representation, many of us
use and cite eighteenth and nineteenth-century newspapers as simple hard-copy references without
mention of how we navigated to the specific article, page and issue. In doing so, we actively misrepresent
the limitations within which we are working.' (Hitchcock 2013, 12)
'This is not only about being explicit about our use of keyword searching - it is about moving beyond a
traditional form of scholarship to data modelling and to what Franco Moretti calls “distant reading”.'
(Hitchcock, Confronting the Digital, 2013, p. 19).
Keyword Search and “Confronting the Digital”
8. Information Search and Seeking
● Search takes place in context
○ Part of seeking, and overall inf. behaviour (Wilson)
○ As inf. behaviour changes (phases), so does seeking
and search behaviour
● Reflection-in-action
○ When and where are choice points?
○ How do search actions relate to strategy and inf.
need?
10. Search and Accountability
● What should scholars account for?
○ Aspects of sources, tools and process
● Digital source criticism
○ How to evaluate digital sources (Fickers 2012)
○ Who made digital source, when, why, what for, how?
● Digital tool criticism
○ How to evaluate impact of digital tools (Koolen et al. 2018)
○ Reflection-in-action, experimentation
● Data Scopes
○ How to communicate research process to others (Hoekstra & Koolen 2018)
○ Discuss process of selection, modelling, normalization, linking, classification
13. Retrieval - Matching and Similarity
● Matching based on user query
○ Query: free text, controlled facet, example (doc, AV or text)
○ Matching docs returned in certain order (non-matching are not retrieved)
■ How does search engine perform matching (esp. for free text and example)?
■ Potentially many objects match query: does order matter?
● Similarity
○ Degree of matching: some match better than others (notion of similarity)
■ Retrieve most similar documents first (ranking)
○ Similar how? Does interface explain?
● Retrieval and ranking
○ Retrieval: which matching documents are returned to the user as results?
○ Ranking: in which order are the results returned?
14. Retrieval, Ranking and Relevance
● Retrieval results form a set
○ Can be ordered or unordered (e.g. SQL or SPARQL query)
■ Even unordered sets need to be presented to the user in some order
○ Criteria for ordering: alphabetic, size, recency, popularity (views, likes, citations, links)
■ Ordering re-organizes materials, temporarily disrupts “original” organization
■ Provides different view on materials
● Many systems perform relevance ranking
○ Relevant to who or what?
■ Query: document similarity scores
■ User: e.g. search history, preferences
■ Situation: user, location, time, device, query, work context (page views, annotations)
■ Other aspects: quality, diversity, controversy, polarity, exploration/exploitation, ...
15. ● How does an algorithm understand the notion of relevance?
○ Statistical interpretation:
■ Generally: frequent words carry less signal, look for unexpected stuff
■ Many ways of scoring signal
○ TF-IDF:
■ Term Frequency in document (relevance of term in document)
■ Inverse of Document Frequency in collection (commonness of term across docs)
○ Probabilistic Language Model (PLM):
■ Probability of picking term from document as bag of words (relevance of term in doc)
■ Probability of picking term from collection as bag of words (commonness of term)
○ Many other relevance models, e.g. BM25, DFR, SDM, …
■ Different interpretations of relevance, hence different rankings
Algorithmic Interpretation of Relevance
16.
17.
18. Ranking Issues
● Document length
○ TF-IDF doesn’t model document length, favours longer documents
○ PLM explicitly normalizes on document length, favours shorter documents
○ Upshot: Delpher API returns short documents first for short queries
● Document priors: are all documents equal or not?
○ Can use document prior probability (independent of query)
○ Can favour documents that are more popular, recent, authoritative, …
○ Can favour documents that are more appropriate for situation (location, time of day, …)
● Problem: how do you know how search engine scores relevance?
○ How much should you know about it?
○ Many GLAM search engines have relatively straightforward relevance models, no doc priors
○ Google uses many hundreds of features for document, query, user and situation
19. Relevance in Metadata Records
● Relevance ranking of metadata records
○ Metadata records are peculiar textual representations
■ Minimal amount of text, low redundancy
■ Majority of terms occur only once
○ Which part of TF-IDF contributes more to score of metadata record?
○ Which fields are useful/used for matching?
● NISV collection
○ Search engine indexes metadata records
■ Some records have lengthy itemized descriptions, some have not
■ Some have transcripts, some have not
○ Consequences for retrieving? And for ranking?
■ How does search engine handle this?
■ How does search engine communicate this?
20. ● Hard to match keywords against AV signal directly
○ Option: use text representation for AV document
■ E.g. metadata, description, script, speech transcript, ...
○ Option: use AV representation of query
■ E.g. example document or user recording
■ Use audio or visual similarity (again, similar how?)
Retrieving and Ranking Audiovisual Materials
21. ● Experiment to understand search functionalities
○ How can you find out if multiple search terms are treated with Boolean AND or OR operators?
○ How can you find out if terms are stemmed/normalized?
● Phrase search:
○ What happens when you use quotation marks to group terms into a phrase?
○ How do the results compare to those using no quotation marks?
● Proximity search:
○ Can you specify that terms should be near each other?
● Fuzzy search: wildcard and edit distance searches
○ Controlling lexical variation vs. uncontrolled wildcard search
○ voetbal+voetballen vs. voetbal* (matches voetbalvereniging, voetbalveld, ...)
Opaqueness of Interfaces and Experimentation
22. ● Experiment with Search and Compare tools of the CLARIAH Mediasuite
○ Find out if stopwords are removed
○ Find out if words are stemmed/normalized
○ Find out how multi-word queries are interpreted, i.e. as AND or OR
○ Find out how standard search operators work
■ Boolean AND, OR and NOT
■ Quotation marks for phrases
Exercise
24. ● Collections of GLAMs are often built up over decades
○ Based on aims and selection criteria
■ Rarely "complete", dependent on availability of materials
○ Digital access via digitization, or digital archiving (born-digital)
■ Some things are lost in this process (e.g. context, quality, …)
● Heterogeneity: mix of object/source types (sub-collections)
○ Different modalities, different ways of accessing and presenting
■ Text vs. Image vs. AV vs. 3D (or 4D)
Nature of Digital Collections
25. Nature of Metadata
● Digital access via metadata
○ Metadata: data about the object/source
○ Types: formal, structural, technical, administrative, aboutness
○ Metadata fields allow selection and search via specific fields
■ Title, description, creator, creation date, genre, …
○ Allows (seemingly) uniform access to heterogeneous collections
■ But, different materials have different aspects to describe
■ Edition is relevant for books and films, not so much for paintings
● Metadata creation process
○ Often done with limited time, information and system flexibility
○ Inherently subjective, especially content analysis
● Size matters
○ Requirements change as size of collection grows (also depends on expectations)
26. ● Hierarchical organization
○ 4 levels
■ Series: De Wereld Draait Door
■ Season: De Wereld Draait Door 2016
■ Program: De Wereld Draait Door 21-06-2016
■ Segment: De Wereld Draait Door 21-06-2016
○ Each level has a metadata record (with overlap in field, e.g. title)
● Follows archival standard
○ Describe aspect at highest relevant level
○ Don’t repeat at lower levels unless it deviates (e.g. main titles)
○ Fonds: aggregation of documents from same origin
Archival Structure and NISV Audiovisual Collection
27. ● Power of the archive
○ Problem of perspective (from archive-as-source to archive-as-subject, Stoler 2002)
● History of the archive
○ Collections created over decades often go through changes in
■ selection criteria, cataloguers (human or algorithm),
■ cataloguing budgets, policies, rules, practice and vocabularies,
■ software (migrations and updates), hardware,
■ institutional mission, societal attitudes, …
○ Most of these aspects remain undocumented or partially documented
● Consequences
○ Almost inherently incomplete, inconsistent and sometimes necessarily incorrect
○ After many years, it's hard to retrace what happened
■ and how it affects access, selection and analysis
Digital Source and Data Criticism
29. Combined Collections
● Several portals combine (heterogeneous) collections
○ Examples:
■ Europeana, European Newspapers, EU screen, Nederlab, Delpher, Online Archives of
California, …
○ Worldwide aggregated collections:
■ ArchiveGrid (1000+ archives): over 5M finding aids
■ WorldCat (72,000 libraries): 400M records, 2.6B assets, 100M persons
● Huge challenge for source criticism as well as search
○ Collections vary in size, provenance, selection criteria, metadata policies, interpretation and
richness
○ Heterogeneous metadata schemas have been mapped to single schema
■ Causes problems for interpretation
■ E.g. what does creator mean for paintings, films, tv series, letters, advertisements, ...?
30. Assessing Metadata Quality
● Questions
○ What are pitfalls in relying on metadata?
○ How can we evaluate metadata quality?
○ What are relevant aspects to consider?
● Collection inspection
○ In CLARIAH Media Suite we created a tool for inspecting metadata
■ Esp. useful for complex collections like NISV audiovisual collection
■ Somewhat ad hoc, please feel encouraged to give feedback!
○ Please go to the Media Suite and go to the Collection Inspector tool
■ Click on “select field to analyse” and let the interface load the data on completeness (this
will take awhile)
31.
32.
33.
34. Assessing Timelines and Other Visualizations
● Timeline visualizations give view of temporal spread
○ Very difficult to interpret properly
● Issues with absolute frequencies:
○ Collection materials not evenly distributed
○ Need to compare query-specific distribution to collection-distribution
● Issues with relative frequencies:
○ Incompleteness not evenly distributed (use collection inspector)
35. Retrievability and Metadata Characteristics
● Different types of metadata fields
○ Controlled vocabulary: e.g. broadcast channel (radio or tv)
○ Number: number of episodes/seasons/segments
○ Time/date: program length, recording date
○ Free keyword/keyphrase: title, person name (tend to be non-unique)
○ Free text: description, summary, transcript, … (tend to be unique)
● Different types allow different forms of retrieval and ranking
○ Long text fields have more terms, with higher frequencies
■ Some types of programs have longer descriptions/transcript
■ These match more queries, so higher chance of being retrieved
■ Impact of long text fields on ranking depends on relevance model!
○ Repeated values allow aggregation, navigation
36. ● Some search interfaces offer facets to narrow down search results
○ E.g. broadcaster and genre in the CLARIAH Media Suite
○ Facets provide overview, afford focusing through selection
● How do facets work?
○ Based on metadata fields: rich schema has rich options for facets
○ Types of metadata fields: controlled vocab, number, date, keyword/phrase, free text
■ Facets work for field with limited range of values, so not free text fields
○ Long tails in facets: typically, few high frequency, many low frequency values
Metadata and Search Facets
37.
38. Exercise
● Experiment with the Collection Inspector of the CLARIAH Mediasuite
○ Try out the collection inspector:
■ Scroll through the list of fields to get an idea of what is available
■ Look at completeness of fields for f.i. “genre”, “keywords” and “awards”
■ Which metadata fields are relatively complete?
■ At which archival levels are they most complete?
● Explore which fields are available and which fields make good facets
○ Explore facet distributions in entire collection and for specific queries
40. ● Importance of selection criteria
○ Do you have to hand pick each document?
○ Or can you select sets based on matching criteria?
○ Is representativeness important? If so, representativeness of what?
○ Or completeness? Why?
● Exploiting facets and dates
○ Filtering: align facets/dates with research focus
○ Sampling: compare across facets
■ Which facet types can you use?
○ Sampling strategies
■ Sample per facet/year (e.g. X items per facet/year)
■ Within facets, select random or not
Searching for Corpus Building
41. Tracking Context in Corpus Building
● Why were certain documents selected?
○ How were they selected?
○ What strategy was used?
○ Documenting helps understanding/remembering choices?
● Do research goals and questions change during collection?
○ Interacting with sources during search updates knowledge structures (Vakkari 2016)
○ Updates tend to be small and incremental, hence barely noticeable
○ Explicit reflection-in-action can bring these to the surface (Koolen et al. 2018)
○ Adding annotations can also provide context
42. Systematic Searching
● Systematic (comprehensive) search has two factors (Yakel 2010):
○ Search strategy (user)
○ Search functionalities (system)
○ Functionalities shape/affect strategy
● Step 1: systematic search for relevant collections online
○ Different collections/sites offer different search functionalities and levels of detail
○ Explicitly address what consequences this has for your strategy and research goals
● Step 2:
○ Explore individual collections using one or more strategies
○ "Researchers need to be flexible and creative to accommodate the vagaries of cataloging
practices." (Yakel 2010, p. 110)
○ Footnote and reference chasing: references often give an "information scent", suggesting
other collections and items to explore.
43. Search Strategies
● Web search strategies defined by Drabenstott (2001)
○ Discussed in archive context by Yakel (2010)
● Five strategies
○ Synonym generation
○ Chaining
○ Name collection
○ Pearl growing
○ Successive segmentation
● Somewhat related to information seeking patterns by Ellis (1989)
○ Starting, chaining, browsing, differentiating, monitoring, extracting
44. ● Synonym generation: 1) search with relevant term, 2) close read results to
identify related terms (wordclouds, facets), 3) search via related terms for
synonyms.
● Chaining: follow references/citations (explicit or implicit), identify relevant
subset and use explicit structure to explore connected/related subset
● Name collection: search with keywords, identify relevant names, search with
names, identify related names and keywords, repeat. Similar to keyword
harvesting (Burke 2011).
Drabenstott’s Strategies (1/2)
45. Drabenstott’s Strategies (2/2)
● Pearl growing: start small and focused with specific search terms, slowly
expand out with additional terms to broader topics/themes
● Successive segmentation: opposite of pearl growing; start broad and
increasingly zoom in and focus; e.g. make queries increasingly specific by
adding (ANDing) keywords, replace broad terms with lower frequency terms,
or select facets
46. Search Strategies and Research Phases
● Research phase
○ Exploration <-> search phase pre-focus
i. Ad hoc, no need yet for systematic search
ii. Mostly pearl growing and/or successive segmentation to determine focus
○ Analysis <-> search phase focus
i. Switch to systematic, determine strategy
ii. Use chaining, name collection, synonym generation (for coverage/representation,
boundaries)
● But reality resists:
○ (Re)search process is very non-linear
○ Boundary between exploration and analysis is not always clear
○ Late discoveries can prompt or force new directions, ...
47. When To Stop
● Often switch from exploration to “sorta” systematic search
○ But hard to remember and explain what and how you searched
○ Moreover, difficult to determine when to stop
○ Explicit strategy allows for stopping criteria
● Stopping criteria
○ Check whole set/sample, all available facets, ...
○ Diminishing returns: you increasingly encounter seen things, new relevance becomes rare
○ When stopping, make explicit (at least for yourself) when and why you stopped
● Meta-strategy:
○ chance strategy/tactics
○ E.g. successive segmentation -> harvest keywords -> switch segment -> harvest keywords, ...
48. Wrap Up
● Search in research
○ How to incorporate these processes in research methodology
● Large, heterogeneous collections introduce issues for research
○ Assessing incompleteness of materials
○ Assessing incompleteness, incorrectness and inconsistency of metadata
● Looking under the hood
○ Evaluating information access functionalities (search and browse)
○ Selecting an appropriate search strategy for research goals
○ Determining success/failure of searches
○ Understanding search for corpus building
49. Burke, T. 2011. How I Talk About Searching, Discovery and Research in Courses. May 9, 2011.
Drabenstott, K.M., 2001. Web Search Strategy Development. Online, 25(4), pp.18-25.
Fickers, F. 2012. Towards a New Digital Historicism? Doing History in the Age of Abundance. View
journal, volume 1 (1). http://orbilu.uni.lu/bitstream/10993/7615/1/4-4-1-PB.pdf
Hitchcock, T. 2013. Confronting the Digital - Or How Academic History Writing Lost the Plot. Cultural and
Social History, Volume 10, Issue 1, pp. 9-23. https://doi.org/10.2752/147800413X13515292098070
Hoekstra, R., M. Koolen. 2018. Data Scopes for Digital History Research. Historical Methods: A Journal of
Quantitative and Interdisciplinary History, Volume 51 (2), 2018.
References
50. References
Koolen, M., J. van Gorp, J. van Ossenbruggen. 2018. Lessons Learned from a Digital Tool Criticism
Workshop. Digital Humanities in the Benelux 2018 Conference.
Putnam L. 2016. The Transnational and the Text-Searchable: Digitized Sources and the Shadows They
Cast. American Historical Review, Volume 121, Number 2, pp. 377-402.
Vakkari, P. 2016. Searching as Learning: A systematization based on literature. Journal of Information
Science, 42(1) 2016, pp. 7-18.
Yakel, E., 2010. Searching and seeking in the deep web: Primary sources on the internet. Working in the
archives: Practical research methods for rhetoric and composition, pp.102-118.