Here are a few ways the HTRC could help with this use case:
1. Enrich the metadata to consistently tag or link works that were originally serialized, and identify the relevant periodicals. This would allow searching and filtering by serialization.
2. Perform text mining on periodical full text to automatically identify works by author and group related snippets together. Machine learning could help with this task.
3. Allow searching bibliographic metadata by author name, then viewing a timeline or network graph of where their works appeared - both in volumes and periodicals. This could help scholars trace serializations.
4. Support analytical workflows that extract and group text snippets by serialized work, rather than just by volume. This would facilitate
Paper presentation at the 2013 DLF Forum, "Building the Archive of Digital Humanities Research: Libraries and Data Curation of Digital Humanities Projects."
Humanities Users in the Digital Age: Library Needs AssessmentHarriett Green
Presentation given at the NFAIS Humanities Roundtable XII for the panel “Is It Marketing to Users, Instruction for Users or Interfering with Users?: Engaging Students, Scholars and Faculty Members”
Paper presentation at the 2013 DLF Forum, "Building the Archive of Digital Humanities Research: Libraries and Data Curation of Digital Humanities Projects."
Humanities Users in the Digital Age: Library Needs AssessmentHarriett Green
Presentation given at the NFAIS Humanities Roundtable XII for the panel “Is It Marketing to Users, Instruction for Users or Interfering with Users?: Engaging Students, Scholars and Faculty Members”
Building a Collaboration for Digital PublishingHarriett Green
Presentation for the "New Collaborations in Digital Publishing" panel at the Association for the Study of African American Life and History (ASALH) 2015 meeting.
Digital Libraries, Digital Archives, Digital Humanities, Digital Scholarship:...Jenn Riley
Riley, Jenn. "Digital Libraries, Digital Archives, Digital Humanities, Digital Scholarship: What’s the Difference? Prioritizing, Strategizing, and Executing." University of North Carolina Scholarly Communications Working Group, December 13, 2011.
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...Micah Altman
The WorldMap platform http://worldmap.harvard.edu is the largest open source collaborative mapping system in the world, with over 13,000 map layers contributed by thousands of users from Harvard and around the world. Researchers may upload large spatial datasets to the system, create data-driven visualizations, edit data, and control access. Users may keep their data private, share it in groups, or publish to the world.
The user base is interdisciplinary, including scholars from the humanities, social sciences, sciences, public health, design, planning, etc. All are able to access, view, and use one another’s data, either online, via map services, or by downloading.
Current work is underway to create and maintain a global registry of map services and take us a step closer to one-stop-access for public geospatial data. Another project is working on tools to support the visualization of spatial datasets with over a billion features. Current collaborations are underway with groups inside Harvard, such as Dataverse, HarvardX, and various departments, and with groups outside Harvard, such as Cornell University and the University of Pennsylvania. Major additional contributors to the underlying source code include the WorldBank, the U.S. State Department, and the United Nations.
The source code for the WorldMap platform is available on GitHub https://github.com/cga-harvard/cga-worldmap.
Location: E25-202
Discussant: Ben Lewis is system architect and project manager for WorldMap, an open source infrastructure that supports collaborative research centered on geospatial information. Before joining Harvard, Ben was a project manager with Advanced Technology Solutions of Pennsylvania, where he led the company in adopting platform independent approaches to GIS system development. Ben studied Chinese at the University of Wisconsin and has a Masters in Planning from the University of Pennsylvania. After Penn, Ben helped start the GIS Lab at U.C. Berkeley, founded the GIS group for transportation engineering firm McCormick Taylor, and coordinated the Land Acquisition Mapping System for South Florida Water Management District. Ben is especially interested in technologies that lower the barrier to spatial technology access.
Information Science Brown Bag talks, hosted by the Program on Information Science, consists of regular discussions and brainstorming sessions on all aspects of information science and uses of information science and technology to assess and solve institutional, social and research problems. These are informal talks. Discussions are often inspired by real-world problems being faced by the lead discussant.
Complicating the Question of Access (and Value) with University Press Publica...Micah Altman
Marguerite Avery, who is a Research Affiliate in the program, presented the talk below as part of Shaking It Up -- a one-day workshop on the changing state of the research ecosystem jointly sponsored by Digital Science, MIT, Harvard and Microsoft.Her talk focuses on current challenges around the accessibility of scholarly content and on a scan of innovative new models aimed to address them.
Using Europeana for learning & teaching: EMMA MOOC “Digital library in princ...Getaneh Alemu
EMMA Massive Open Online Course (MOOC) is an implementation of a broader paradigm shift in learning
A social constructivist approach to learning where students are proactively engaged in an open, democratic, inclusive and collaborative environment (Jean Piaget & Lev Vygotsky)
Shifts in pedagogy and learner interaction
Multilingual content and interaction and co-creation of content by participants
This presentation considers the changing nature of the scholarly record and applies the findings of NMC Horizons Report Library Edition 2014 to the Claremont Colleges Library's institutional repository.
Building a Collaboration for Digital PublishingHarriett Green
Presentation for the "New Collaborations in Digital Publishing" panel at the Association for the Study of African American Life and History (ASALH) 2015 meeting.
Digital Libraries, Digital Archives, Digital Humanities, Digital Scholarship:...Jenn Riley
Riley, Jenn. "Digital Libraries, Digital Archives, Digital Humanities, Digital Scholarship: What’s the Difference? Prioritizing, Strategizing, and Executing." University of North Carolina Scholarly Communications Working Group, December 13, 2011.
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...Micah Altman
The WorldMap platform http://worldmap.harvard.edu is the largest open source collaborative mapping system in the world, with over 13,000 map layers contributed by thousands of users from Harvard and around the world. Researchers may upload large spatial datasets to the system, create data-driven visualizations, edit data, and control access. Users may keep their data private, share it in groups, or publish to the world.
The user base is interdisciplinary, including scholars from the humanities, social sciences, sciences, public health, design, planning, etc. All are able to access, view, and use one another’s data, either online, via map services, or by downloading.
Current work is underway to create and maintain a global registry of map services and take us a step closer to one-stop-access for public geospatial data. Another project is working on tools to support the visualization of spatial datasets with over a billion features. Current collaborations are underway with groups inside Harvard, such as Dataverse, HarvardX, and various departments, and with groups outside Harvard, such as Cornell University and the University of Pennsylvania. Major additional contributors to the underlying source code include the WorldBank, the U.S. State Department, and the United Nations.
The source code for the WorldMap platform is available on GitHub https://github.com/cga-harvard/cga-worldmap.
Location: E25-202
Discussant: Ben Lewis is system architect and project manager for WorldMap, an open source infrastructure that supports collaborative research centered on geospatial information. Before joining Harvard, Ben was a project manager with Advanced Technology Solutions of Pennsylvania, where he led the company in adopting platform independent approaches to GIS system development. Ben studied Chinese at the University of Wisconsin and has a Masters in Planning from the University of Pennsylvania. After Penn, Ben helped start the GIS Lab at U.C. Berkeley, founded the GIS group for transportation engineering firm McCormick Taylor, and coordinated the Land Acquisition Mapping System for South Florida Water Management District. Ben is especially interested in technologies that lower the barrier to spatial technology access.
Information Science Brown Bag talks, hosted by the Program on Information Science, consists of regular discussions and brainstorming sessions on all aspects of information science and uses of information science and technology to assess and solve institutional, social and research problems. These are informal talks. Discussions are often inspired by real-world problems being faced by the lead discussant.
Complicating the Question of Access (and Value) with University Press Publica...Micah Altman
Marguerite Avery, who is a Research Affiliate in the program, presented the talk below as part of Shaking It Up -- a one-day workshop on the changing state of the research ecosystem jointly sponsored by Digital Science, MIT, Harvard and Microsoft.Her talk focuses on current challenges around the accessibility of scholarly content and on a scan of innovative new models aimed to address them.
Using Europeana for learning & teaching: EMMA MOOC “Digital library in princ...Getaneh Alemu
EMMA Massive Open Online Course (MOOC) is an implementation of a broader paradigm shift in learning
A social constructivist approach to learning where students are proactively engaged in an open, democratic, inclusive and collaborative environment (Jean Piaget & Lev Vygotsky)
Shifts in pedagogy and learner interaction
Multilingual content and interaction and co-creation of content by participants
This presentation considers the changing nature of the scholarly record and applies the findings of NMC Horizons Report Library Edition 2014 to the Claremont Colleges Library's institutional repository.
The common use by archaeologists of ubiquitous technologies such as computers and digital cameras means that archaeological research projects now produce huge amounts of diverse, digital documentation. However, while the technology is available to collect this documentation, we still largely lack community accepted dissemination channels appropriate for such torrents of data. Open Context (http://www.opencontext.org) aims to help fill this gap by providing open access data publication services for archaeology. Open Context has a flexible and generalized technical architecture that can accommodate most archaeological datasets, despite the lack of common recording systems or other documentation standards. Open Context includes a variety of tools to make data dissemination easier and more worthwhile. Authorship is clearly identified through citation tools, a web-based publication systems enables individuals upload their own data for review, and collaboration is facilitated through easy download and other features. While we have demonstrated a potentially valuable approach for data sharing, we face significant challenges in scaling Open Context up for serving large quantities of data from multiple projects.
A whirlwind introduction to digital humanities for CDP Digital Humanities: Collections & Heritage - current challenges and futures workshop. February 22, 2018 Imperial War Museum
Making sure your content is licenced and discoverable
A presentation from the JISC Programme Meeting for its Content Programme for 2011 http://www.jisc.ac.uk/whatwedo/programmes/digitisation/econtent11.aspx
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...Micah Altman
In his talk for the MIT Libraries Program on Information Science, Steve Griffin discusses how how research libraries can play a key and expanded role in enabling digital scholarship and creating the supporting activities that sustain it.
This presentation is intended to give some brief advice for those publishing
digital content (digital images, cultural heritage, scholarly information etc.)
on the Internet and in particular how to ensure good visibility via Google and other portals
The library and the network: scale, engagement, innovationlisld
Presented at Georgetown University Library. Discusses ongoing reconfiguration of libraries by networks. A shift from infrastructure to engagement around developing research and learning needs. Also includes some analysis of Georgetown collections in the context of Worldcat.
The vision for ‘the Research Paper of the Future’ promises
to make scholarship more discoverable, transparent,
inspectable, reusable and sustainable. Yet new forms
of scientific output also challenge authors, librarians,
publishers and service providers to register, validate,
disseminate and preserve them as elements of the scholarly
record. What constitutes authorship in a collaborative
process of GitHub pull requests and commits? When to
capture, reference and preserve dynamic data sets that
change over time? How to package and render complex
executable collections for review and delivery? This session
considers key challenges in operationalising the Research
Paper of the Future from the perspectives of a publisher,
a library administrator and a scientist/developer of a
collaborative authoring platform.
Notes from attending FORCE2019 conference in Edinburgh (October 15-18), covering a range of topics around Research Communications, e-Scholarship, Open Science and Open Access. Links on last slide for full conference programme and presented materials available online.
Similar to Workset Creation for Scholarly Analysis Project presentation at CNI 2013 (20)
Bandits and Browsing: Effective Collection Size as Way of Quantifying Search...Harriett Green
This poster presents our preliminary research on how information can be extracted from user browsing behavior to identify understudied works that are relevant but have too few viewers. We investigate how to apply two types of analysis—a formula called Effective Collection Size and ‘multi-armed bandit’ analysis—to extracted user data to develop alternative methods of retrieving materials from collection that are collated by richer factors of relevancy. We anticipate that these analyses will enable the development of an information retrieval system that presents a broad range of content in a user’s search results.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
The Art Pastor's Guide to Sabbath | Steve ThomasonSteve Thomason
What is the purpose of the Sabbath Law in the Torah. It is interesting to compare how the context of the law shifts from Exodus to Deuteronomy. Who gets to rest, and why?
The Indian economy is classified into different sectors to simplify the analysis and understanding of economic activities. For Class 10, it's essential to grasp the sectors of the Indian economy, understand their characteristics, and recognize their importance. This guide will provide detailed notes on the Sectors of the Indian Economy Class 10, using specific long-tail keywords to enhance comprehension.
For more information, visit-www.vavaclasses.com
How to Create Map Views in the Odoo 17 ERPCeline George
The map views are useful for providing a geographical representation of data. They allow users to visualize and analyze the data in a more intuitive manner.
How to Split Bills in the Odoo 17 POS ModuleCeline George
Bills have a main role in point of sale procedure. It will help to track sales, handling payments and giving receipts to customers. Bill splitting also has an important role in POS. For example, If some friends come together for dinner and if they want to divide the bill then it is possible by POS bill splitting. This slide will show how to split bills in odoo 17 POS.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Workset Creation for Scholarly Analysis Project presentation at CNI 2013
1. HathiTrust Research Center:
Improving Scholarly Inquiry
Timothy W. Cole (t-cole3@illinois.edu)
Harriett Green (green19@illinois.edu)
With slides and other contributions from Stephen Downie, Beth
Plale, Colleen Fallaw, Megan Senseney, Katrina Fenlon, et al.
CNI Fall 2013 Membership Meeting
Washington, D.C.
9 December 2013
2. Outline
The HathiTrust Digital Library (HT)
The HathiTrust Research Center (HTRC)
The Workset Creation for Scholarly Analysis (WCSA) Project
User needs & requirements
Characterization of bibliographic metadata for corpus
More about the WCSA RFP & prototyping projects
Timothy W. Cole (t-cole3@uiuc.edu)
University of Illinois at UC
2
CNI 2013 Fall Membership Meeting
9 December 2013
3. The HathiTrust Digital Library (hathitrust.org)
A digital preservation repository coupled with
a highly functional access platform
An international partnership of 80+ research libraries & consortia
Provides long-term preservation of and access to volumes of
member library collections that have been digitized by
Google, the Internet Archive, Microsoft & member institutions
Currently supports ingest of digitized book and journal content,
and similar book-like materials
Timothy W. Cole (t-cole3@uiuc.edu)
University of Illinois at UC
3
CNI 2013 Fall Membership Meeting
9 December 2013
4. HT DL by the numbers (as of Nov 2013)
10,973,063 total volumes
6,067,835 distinct bibliographic items:
5,778,450 book (monographic) titles
289,385 serial titles
3,803,630,600 pages
487 terabytes
3,512,404 volumes (~32% of total) digitized
from public domain originals
5. More than just US Libraries
Hebrew
Hindi Indonesian
Polish 1%
1%
Dutch 1%
1%
1%
Portuguese
1% Latin
Arabic
2%
Other
9%
97% of bibliographic records
specify resource language
Only 7% specify more than 1
1%
Italian
3%
Japanese
3%
Chinese
4%
English
49%
Russian
4%
Spanish
4%
French
7%
German
10%
6. HT DL Searching & Data Availability
Web User Interface (http://www.hathitrust.org/home):
Full text keyword (includes indexed metadata)
Bibliographic metadata keyword
Advanced (field-specific) bibliographic metadata searching
Bibliographic metadata (http://www.hathitrust.org/data):
OAI-PMH & custom bib API – http://www.hathitrust.org/bib_api
HathiFiles (tab delimited metadata) –
http://www.hathitrust.org/hathifiles
Full-text (http://www.hathitrust.org/datasets):
~300,000 digitized volumes in the public domain – contact for bulk
download or use API for volume-by-volume access
~ 3,500,000 volumes digitized by Google from public domain – available
by arrangement, typically using rsync. Must agree to conditions of use.
Timothy W. Cole (t-cole3@uiuc.edu)
University of Illinois at UC
6
CNI 2013 Fall Membership Meeting
9 December 2013
7. How many pages per volume?
For volumes digitized
from public domain
sources
Timothy W. Cole (t-cole3@uiuc.edu)
University of Illinois at UC
CNI 2013 Fall Membership Meeting
9 December 2013
8. The HathiTrust Research Center (1)
HTRC is a collaboration between HT, Indiana University
and the University of Illinois at Urbana-Champaign
Goal is to provide computational access to researchers:
initially to all content digitized from public domain
eventually to the entire HT DL corpus
Currently hosts
complete copy of HT metadata
copy of OCR of all HT volumes digitized from public domain
copy of OCR of all public domain volumes in HT
Supported by the Sloan Foundation, the Mellon Foundation, IU, & UIUC
Timothy W. Cole (t-cole3@uiuc.edu)
University of Illinois at UC
8
CNI 2013 Fall Membership Meeting
9 December 2013
9. The HathiTrust Research Center (2)
HTRC end-user access (so far)
HTRC Portal
https://htrc2.pti.indiana.edu/HTRC-UI-Portal2/
Must login; pull-down login menu (upper right) to sign up (free)
HTRC Workset Builder
https://htrc2.pti.indiana.edu/blacklight
Must login to this interface also; same credentials used.
HTRC Sandbox (contact us)
http://sandbox.htrc.illinois.edu:8080
Clone of Portal, but accessing 250,000 digital public domain volumes
Supports use data api – http://wiki.htrc.illinois.edu/display/COM
As well as HTRC Solr Proxy api – http://chinkapin.pti.indiana.edu:9994
More: http://www.hathitrust.org/htrc/faq, http://wiki.htrc.illinois.edu/display/OUT
Timothy W. Cole (t-cole3@uiuc.edu)
University of Illinois at UC
9
CNI 2013 Fall Membership Meeting
9 December 2013
10. HTRC Portal (as it is now)
Timothy W. Cole (t-cole3@uiuc.edu)
University of Illinois at UC
10
CNI 2013 Fall Membership Meeting
9 December 2013
11. HTRC Workset Builder (as it is now)
Timothy W. Cole (t-cole3@uiuc.edu)
University of Illinois at UC
11
CNI 2013 Fall Membership Meeting
9 December 2013
12. Create a small workset / collection
Timothy W. Cole (t-cole3@uiuc.edu)
University of Illinois at UC
12
CNI 2013 Fall Membership Meeting
9 December 2013
13. Submit a small workset / collection for analysis
Timothy W. Cole (t-cole3@uiuc.edu)
University of Illinois at UC
13
CNI 2013 Fall Membership Meeting
9 December 2013
14. Your completed and pending analytical jobs
Timothy W. Cole (t-cole3@uiuc.edu)
University of Illinois at UC
14
CNI 2013 Fall Membership Meeting
9 December 2013
16. Workset Creation for Scholarly Analysis
Premise
The ability to slice through a massive
corpus constructed from many different
library collections, and out of that to
construct the precise workset required
for a particular scholarly investigation, is
an example of the “game changing”
potential of the HathiTrust...
17. Motivation & Models
Collections, corpora, worksets, ...:
Scholars & librarians aggregate
items in a variety of contexts:
Archival
Curatorial
Experimental
Referential
Thematic
These worksets facilitate, sometimes
enable certain kinds of scholarly inquiry
Carl Spitzweg. 1850
The Bookworm
(Der Bücherwurm)
Analogy: HathiTrust worksets for analysis are
as the contents of a scholar’s carrel in a library
With apologies to Martin Mueller, et al. 2010. Towards a digital carrel: A report about corpus query tools
18. Anecdotal feedback from UnCamp 2013
My workset should contain…
Volumes pertaining to Japan / in Japanese
All volumes relevant to the study of Francis Bacon
Music scores or notation extracted from HT volumes
Images of Victorian England extracted from HT vols.
Volumes in HT similar to TCP-ECCO novels
19th c. English-language novels by female authors
Representative sample (by pub date & genre) of
French language items in HT
19. What is a Workset (in context of HTRC)?
A workset is an aggregation brought together for the purpose
of analysis, i.e., to facilitate inquiry.
Worksets are conceptual and need to be expressible in a
variety of ways
A workset encapsulates the specific materials that share
specific attributes or satisfy some set of criteria.
May be large, e.g., tens of thousands of items.
Can be constructed by machine as well as human agents.
Attributes and criteria not always bibliographic
Items aggregated may be more granular than a volume
20. Why Worksets?
The result of a first-level, rough filter
Better scale for intensive analytics
Provides essential scope for certain analytics
Some tools (are trained to) work best on a narrow,
homogeneous work-set
Eliminate noise that would otherwise arise by asking
questions across whole of HT
22. Workset Creation for Scholarly Analysis
Prototyping Project
Collection analysis, data modeling and prototype
tools & services to facilitate workset creation
Principal investigators:
J. Stephen Downie, Tim Cole, Beth Plale
Funded by a grant from the Andrew W. Mellon Foundation
1 July 2013 - 30 June 2015
Will feature 4 $40K sub-awards for prototyping/demonstration
projects illustrating how worksets from HT DL can be created
and used and can be useful for scholarly analysis
Methods & tools for metadata enrichment, including with links
Analytical services over full text useful for defining worksets
23. Key research questions for WCSA project
Can we formalize the notion of collections
and worksets in the HTRC context?
What are the attributes that define
and describe a workset in the context of HTRC?
How can we balance rigor with extensibility & flexibility?
What roles do data, metadata, annotations, tags,
feature sets, and so on, play in the conception,
creation, use and reuse of collections and worksets?
Can we demonstrate the utility & practicality of worksets for HTRC?
24. WCSA Timeline
• July 2013: Project Start
• Q1:
User needs assessments / focus groups
• Q2:
HT Corpus characterization
Request For Prototype Proposals (RFP)
• Q3:
RFP Finalist Workshop (Chicago)
Prototype experiment funding awarded
• Q4-6:
Prototype experiments done
Metadata workflow & workset modeling
• Q7-8:
Planning for prototype to production
Report out
• June 2015: Project ends
25. USER NEEDS & REQUIREMENTS
Harriett Green, English and Digital Humanities Librarian
Preliminary results
An early deliverable of WCSA Project
26. Who Are Our Researchers?
Humanities scholars? Computer programmers
and technologists? Digital humanities
research teams?
Previous research in scholarly use of digital
resources (Duff and Cherry 2000; Brockman
et al. 2001; Warwick et al., 2008; Sukovic,
2008 and 2011; RIN 2011)
Identify use cases for HTRC and large-scale,
digitized text corpora
27. GOOGLE DIGITAL HUMANITIES AWARDS
RECIPIENT INTERVIEWS REPORT
Report prepared for the HTRC in 2011 by UIUC
researchers at GSLIS’s Center for Informatics
Research in Science and Scholarship (CIRSS)
Interviewed researchers who were awarded
Google Digital Humanities Research Awards on
research needs
Findings for scholarly requirements included
improved metadata, accurate OCR, data curation
Report available to download at
http://www.hathitrust.org/htrc
28. Feedback from UnCamp 2013
My work-set should contain…
Volumes pertaining to Japan / in Japanese
Music scores or notation extracted from HT volumes
Volumes in HT similar to TCP-ECCO novels
General Needs:
19th c. English-language novels by female authors
User-friendly interfaces
Documentation on the portal
Avenues for community input in HTRC portal
development
29. Scholarly Requirements
We are interested in understanding how scholars and researchers
that use digital book and serials collections decide which texts
(or parts of texts) to include in collections used for analysis. This
includes:
How researchers identify, select and obtain access to texts to
include in their analysis
Understanding the specific fields/disciplines that work with these
sources along with the types of research questions and analysis
applied.
Desired units of analysis (works, manifestations, pages, n-grams
OCR, images, etc.)
Transformation and preprocessing steps;
Understanding sources and criteria used for identifying texts
Specific methods of selection
Methods of analysis
Challenges to working with these digital collections (e.g., OCR
quality, duplication)
30. Focus Groups and Interviews
Conducted at DH 2013, JCDL 2013, and HTRC Uncamp
conferences in summer and fall 2013
Goal: To understand practices of humanities researchers
using digital collections, especially in the context of
large-scale text corpora
Survey instrument queried users about their experiential
practices of organizing datasets
31. Participant Demographics
Positions:
Domains:
Junior and senior faculty at liberal arts colleges and universities
Computer programmers
Librarians
Data scientists
Academic technologists
Graduate students
English literature, classics, linguistics, library and information
science, history
Institutions:
Academic institutions in Great Britain, Singapore, Germany,
France, and United States
32. Study Design
1.
General types of data, materials, or collections
1.
Purposes of collections
2.
Selection or inclusion/exclusion criteria
3.
Sources, acquisition, and access
4.
Pre-processing and analysis
5.
Post-analysis
6.
Challenges
33. Analysis
Methodology: Qualitative content analysis of user
responses
•
A ―directed‖ approach based on inductive reasoning to
condense raw data (transcriptions of audiorecordings of
interviews and focus groups) into categories and themes
Goal: To identify common themes and patterns in users’
responses
34. Coding (still ongoing)
Coding manual consisting of category names, rules for
assigning codes, and examples:
Challenges — access rights
Challenges — OCR quality
Collections — comprehensiveness
Objects — data
Sources — Google Books
Sources — Selection Criteria — Language
etc.
35. Selected examples for categories
Category:
Challenges— Access Rights
User: ―I check to see if a volume has
substantial copyrighted text included in it
already as quotes or extracts‖
Category: Objects — Temporal
User: ―Classic materials‖
User: ―single-authored books of poetry
between 1840 and 1900‖
Etc.
36. Early Findings
•
•
•
Roles of collections
Need to implement granular,
actionable units of analysis
Importance of expert-enriched,
shareable metadata
37. Figure 1. Selected focus group and interview excerpts on collection- and
workset-building.
38. Figure 2. Selected focus group and interview excerpts on divisibility
and objects of analysis.
39. Figure 3. Selected focus group and interview excerpts on metadata
enrichment and sharing.
40. Use Case 1: Gender
Scholar wants to compare works by gender, based on
the Library of Congress headings
This information is in the metadata, but hard to text
mine
Questions:
How can I track gender of authored texts across time?
What correlations are there between gender of the author
and sentiment analysis of the text?
How people and characters of different genders are
treated in books over time?
41. Use Case 2: Serials
A scholar wants to find a series of an author’s works that
were originally serialized across several issues or volumes
of a periodical.
Serials vs. volumes as manifestations of works
Map the pages for content
Might be able to investigate questions as:
What was the original instantiation of the work in
serialized form?
How can I text mine for sentiment and themes across the
serialized texts?
42. Use Case 3: Images
Scholar wants to find texts of Victorian travel narratives
and the images depicted in them.
Investigate questions such as:
What are patterns/themes of images depicted in
Victorian England travel narratives?
What is the frequency of images in travel narratives?
43. Use Case 4: Dialogue in Texts
Scholar wants to identify conversational dialogue between
characters in novels.
•
Requires OCR that detects boundaries: can we detect
quote marks and signal words for dialogue?
•
Create a training set of curated texts (i.e., TCP texts)
matched with HTRC texts, apply detection algorithm
•
Enable questions such as:
How are characters connected across the narrative—who
interacts most frequently?
What would sentiment analysis or topic modeling reveal
about the dialogue in comparative novels of the genre?
44. User Needs for Worksets
Comments from interviews/focus groups:
―How do I gather works similar to those I currently have in
hand? Can I define different kinds of similarity?‖
―How do I merge a HathiTrust collection of works and
metadata with my set of works and tags and my
colleague’s annotations?‖
45. How useful is existing metadata for creating worksets?
HT metadata is bibliographic
Built from MARC records provided by members & OCLC
Good, consistent quality for author / title / pub info
Subject less extensive, less consistent
Genre more hit and miss
Author gender not present in MARC bib records
is present in some MARC authority records
MARC records provided for serials are about the serial
not about the contents of the serial
No visibility over internal elements (e.g., images, embeded
language / genre, dialog, ...) of digitized volumes
Timothy W. Cole (t-cole3@uiuc.edu)
University of Illinois at UC
CNI 2013 Fall Membership Meeting
9 December 2013
46. Total Records
6.1 million
Having at least 1 genre
85%
Having no genre
Timothy W. Cole (t-cole3@uiuc.edu)
University of Illinois at UC
5.2 million
0.9 million
15%
CNI 2013 Fall Membership Meeting
9 December 2013
47. Breakdown of other category, incl. fiction
Timothy W. Cole (t-cole3@uiuc.edu)
University of Illinois at UC
CNI 2013 Fall Membership Meeting
9 December 2013
48. 2.6 million (43%)
of bib records
include LC Class no.
Timothy W. Cole (t-cole3@uiuc.edu)
University of Illinois at UC
CNI 2013 Fall Membership Meeting
9 December 2013
49. Not all genres equally described
Volumes identified as fiction
at least one subject
no subject
subectGeographic
subjectTemporal
subjectTopic
subjectName
Timothy W. Cole (t-cole3@uiuc.edu)
University of Illinois at UC
270837
70706
200131
26%
74%
25491
12536
61788
13412
9%
5%
23%
5%
CNI 2013 Fall Membership Meeting
9 December 2013
50. Top genres by country of publication
bib records
country specified
and genre specified
Timothy W. Cole (t-cole3@uiuc.edu)
University of Illinois at UC
6067835
5361473
4776973
CNI 2013 Fall Membership Meeting
9 December 2013
51. Top subjects by country of publication
bib records
6067835
country specified
and LC class no. specified
Timothy W. Cole (t-cole3@uiuc.edu)
University of Illinois at UC
5361473
2354094
CNI 2013 Fall Membership Meeting
9 December 2013
52. Fiction as proportion of publications by decade
Timothy W. Cole (t-cole3@uiuc.edu)
University of Illinois at UC
CNI 2013 Fall Membership Meeting
9 December 2013
53. Opportunities
Computed attributes
Author age at time of publication
FRBR relationships
Add attributes not included in bibliographic records
Author gender
Author nationality
Improve completeness & accuracy of bib records
Describe internal components of volumes
Timothy W. Cole (t-cole3@uiuc.edu)
University of Illinois at UC
CNI 2013 Fall Membership Meeting
9 December 2013
54. More about RFP
4 awards to teams of scholars, librarians & developers
$40,000 each
Period of performance 16 April 2014 – 15 Jan 2015
UIUC will supply a testbed of ~250,000 representative volumes;
additional volumes (digitized from public domain) available
UIUC will collaborate, provide access to HTRC cluster, ...
Deliverables: final report; open source software
Schedule:
Letters of Intent Due (preferred): 16 December 2013
Final Proposals Due: 13 January 2014
Shortlist Meeting Invitations Issued: 20 January 2014
Shortlist Meeting: 20 February 2014
Award Notification: No later than 15 March 2014
Timothy W. Cole (t-cole3@uiuc.edu)
University of Illinois at UC
55
CNI 2013 Fall Membership Meeting
9 December 2013
55. Questions?
Timothy Cole
Mathematics and Digital Services Librarian,
UIUC
t-cole3@illinois.edu
Harriett Green
English and Digital Humanities Librarian, UIUC
green19@illinois.edu
Twitter: @greenharr
56. Discussion Questions
Key questions to look for in the data
Alternative approaches and methodologies
Knowing what we know about user needs to date, what
are the implications for formalize the notion of workset
How does this translate across domains? (e.g., Worksetlike objects in science and elsewhere...)
What are the re-usability and re-producibility implications
for such highly individualized and complex digital objects
Editor's Notes
HathiTrust is a partnership of major research institutions and libraries working to ensure that the cultural record is preserved and accessible long into the future. There are more than 80 partners in HathiTrust, and membership is open to institutions worldwide.
Ray soun de tra
Subfields: statistical text analysis; 19th c. American and British history, women’s history, digital humanities, digital collections, HCI, visualization, literary studies, digital philology, digital classicsAges: Ranged from 26 to 62 but the average was about 48, median was 50Besides US, nationalities included French, German, Indian, British, and ItalianExperience in the area ranged from 2 years to 30