Presented at the 10th annual Data Harmony Users Group meeting on Tuesday, February 11, 2014 by Sharon Garewal of JSTOR. JSTOR is an archive of over 8 million articles, book chapters, and primary source content. Last year at DHUG, a case study on the then incomplete JSTOR Thesaurus was presented. This presentation will give an update on how the completed thesaurus has been constructed and how branches are currently being reviewed and revised. Training materials and workflow processes, which have been documented for maintenance and editing of the thesaurus, will be shared. Discussions on finding and working with subject matter experts and the triumphs and challenges that have been encountered along the way will be highlighted.
2. Agenda
Introduction
Where we left off
Where we are now
Maintenance and Editing Process
◦ Training
◦ Documents
◦ Workflows
Where we are going
3. INTRODUCTION
The Numbers:
1,061Participating publishers
1,936 Academic journals
8,952,463 Articles and counting
21,737 Books and counting
25,976 19th Century British Pamphlets
Publishing dates: 1545 CE to 2014 CE
138 Million searches performed in 2013
5. Where we left off: Pilot project
Goal: To better understand the use and deployment of
thesauri & test selected thesauri on our disciplines
◦ Vendor: Access Innovations (AI) Data Harmony Software:
MAIStro for auto-indexing and creating a rulebase.
◦ Selection: 15k articles from 3 disciplines.
3 thesauri selected: NICEM for History, CABI for Science
and AMB for Business
Auto-indexed and assigned terms.
Added to rulebase to improve indexing.
◦ Lessons learned:
Selecting the (right) thesauri is important
Rule building increases accuracy from 70% to 88%
Maintenance needs to be on-going
6. Building of the thesaurus
Access Innovations (AI)
◦ Selected, collected and imported
sources from 17+ source
vocabularies
◦ Merged the lists which included
sorting terms hierarchically and
removing duplicates.
◦ Build and tested rules
◦ Used search logs, discipline lists
and access to our content
Construction standards
◦ ANSI (American National Standards
Institute)
◦ NISO (National Information
Standards Organization)
ANSI/NISO Z39-19.2005
◦ ISO (International Standards
Organization)
ISO 2788, ISO 5964
◦ BS (British Standards Institute)
BS 8723 parts 1-4
AI Business thesaurus
AI Calculus thesaurus
AI Economics thesaurus
AI Geology thesaurus
AI Law thesaurus
AI Psychology thesaurus
ASIS&T
CABI
ERIC
Ethnographic thesaurus
EuroVoc
Getty Arts and Architecture thesaurus
Glossary of Statistics
MeSH (abridged)
NASA Thesaurus
NAL
NICEM
Philosopher’s Index Thesaurus
Statistics Canada
National Transportation Library
7. Where we are now
Thesaurus was officially delivered: June 2013
Continued editorial relationship with AI
Added two more JSTOR Librarians to
thesaurus team
Thesaurus Statistics
◦ Preferred Terms: 56,913
◦ Equivalent Terms: 41,608
◦ Top Terms or Branches: 18
Terms with [at least one] Related Term: 18,965
Rulebase Statistics
◦ Rules: 100,737
10. Maintenance and Editing
Process: Training
Training Power Point
◦ Pre-training day activities-Reading
through standards, attend meetings…
◦ 3-4 day hands on training in MAIstro
Weekly tasks
◦ Adding, deleting and moving terms;
Changing the capitalization of terms;
Searching the rulebase; Interpreting
complex rules; Adding complex rules
11. Reviewing a branch
Orient yourself in the branch
Review terms in the branch
Take notes
Research the branch
Organize the branch
◦ Best practices
◦ Decision tree
11
12. Keep the following in mind when
reviewing terms:
◦ Appropriacy: Is the term appropriate to the target audience?
◦ Belonging: Does the concept fit within the coverage of the thesaurus structure?
◦ Consistency: Is the term stylistically consistent with the other terms in the thesaurus
structure?
◦ Currency: Does the term reflect the most current common usage for the concept?
◦ Distinctiveness: Does the term clearly represent a distinction that is important to the
audience?
◦ Implication: Does the term imply additional concepts or terms?
◦ Novelty: Does the term refer to a concept that is not already in the thesaurus?
◦ Standardization: Is the term part of an authorized standard vocabulary for which there is a
compliance requirement?
◦ Structure: Does a proposed new concept/term, along with others, warrant a new branch in
the thesaurus?
◦ Technical Accuracy: Does the term accurately reflect the intended meaning to the intended
audience?
◦ Warrant: Can you find explicit warrant (support) for your concept/term in: The JSTOR
corpus, its usage, standard vocabularies for which there are compliance requirements?
(i.e., user warrant & literary warrant)
From,: Weise, C. Criteria for Term Selection in Your Taxonomy, Feb. 1, 2013.
12
13. Costume design is the fabrication of clothing for the overall appearance of a character or
performer. Costume is specific in the style of dress particular to a nation, a class, or a period…13
16. 16
Does the term already exist
in the thesaurus?
If no, search JSTOR
How many search results are
there?
Less than 100 hits, do not
add.
More than 100 hits.
Investigate
.
What is the term about? What journals
and topics are associated with the
term?
Does the term appear in article-titles?
citations? abstracts? Body of the text.
This is subjective
Add term
If yes, look at where it lives
and see if you can add any
NPT’s or RT’s.
18. Complex rules
Proximity
◦ NEAR: Within 3 words of text-to-match.
Used for phrasings of a term, prepositional
phrases etc.
◦ WITH: Within the same sentence of text-to-
match. This is the most common/default.
◦ AROUND: Within 50 words of text-to-
match, which is approximately one
paragraph.
◦ MENTIONS: Within 250 words of text-to-
match, which is approximately one page.
Helps cut down on noise by establishing the
broadest area possible. Not used as
frequently.
18
19. Complex rules continued
ALL CAPS
◦ Text to match: sat IF (ALL CAPS) USE Standardized tests
INITIAL CAPS
◦ Text to match: bush IF (INITIAL CAPS) USE U.S. Presidents
MATCH
◦ Text to match: IF (MATCH “musicianship”) USE Musicianship
BEGINS SENTENCE or ENDS
SENTENCE
◦ Text to match: chronicle IF(BEGINS SENTENCE) USE History
◦ Text to match: lol IF (ENDS SENTENCE) USE Humor
Remember to use Booleans!
AND, OR, NOT
ELSE & ELSE IF
19
22. Testing articles
Choose an article from JSTOR to
copy/paste into Test MAI tab.
Evaluate the list of MAI suggested
terms.
MAI Suggested Terms;
Temperature|(54) temperature(54)
Circadian rhythm|(9) temperature
compensation(9)
Parametric models|(6) model*(6)
Biochemistry|(5) biochemical(4)
biochemistry(1)
Temperature dependence|(5)
dependen*(3) temperature
dependence(2)
The term on the left side of the | is
the MAI suggested term; the term on
the right is the word that triggered it.
Hits – System accurately and
correctly suggests indexing
terms chosen by the editor. No
additional rulebuiding is
necessary.
Misses – System misses terms
the editor uses. Reviewing
articles, following a gap analysis
is necessary to identify misses.
Rulebuilding and possible
additional term building is
necessary.
Noise – System suggests terms
not used by editor or incorrectly
suggests a term that is used by
the editor but it’s meaning is not
accurately represented.
Rulebuilding is necessary.
22
25. Maintenance and Editing
Process: Documents
How-To-Guides
◦ How to configure Unicode setting in web
browsers
◦ How to correct capitalization in the term
record
◦ How to export a sub-branch
◦ How to install MAIstro on your computer
◦ How to remove Related Terms which appear
in the same branch
Term Building Instructions
Rule Building Instructions
26. Instructions for Terms and Rule building include key
terms, definitions and best practices.
27. Parking Lots
“Parking Lots” are a way to keep track
of terms that we want to look into and
rules we need to build.
28. Maintenance and Editing
Process: Workflows
Data analysis
◦ General accuracy: Sampling of 1000 articles
across content types and disciplines.
◦ Subject specific: Sampling on specific
disciplines and/or journals.
New content assessment
◦ Weekly review of newly signed content
Search log review
◦ Done semiannually; Report of searched
terms in JSTOR. Review ranking of
terminology and how term usage changes
over time. Finding new acronyms.
29.
30. Where we want to go
Implementation onto the platform in 2014
◦ Currently working with teams in JSTOR such
as UX and Analytics to run experiments and
gather metrics.
Name file
Staffing and resources
◦ Continue to train additional Librarians and
create additional workflows.
SME’s
◦ Set up a system to work with SME’s