Case Study: JSTOR: A Year Later


Published on

Presented at the 10th annual Data Harmony Users Group meeting on Tuesday, February 11, 2014 by Sharon Garewal of JSTOR. JSTOR is an archive of over 8 million articles, book chapters, and primary source content. Last year at DHUG, a case study on the then incomplete JSTOR Thesaurus was presented. This presentation will give an update on how the completed thesaurus has been constructed and how branches are currently being reviewed and revised. Training materials and workflow processes, which have been documented for maintenance and editing of the thesaurus, will be shared. Discussions on finding and working with subject matter experts and the triumphs and challenges that have been encountered along the way will be highlighted.

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Case Study: JSTOR: A Year Later

  1. 1. JSTOR Case Study A Year Later Sharon Garewal, Metadata Librarian
  2. 2. Agenda  Introduction  Where we left off  Where we are now  Maintenance and Editing Process ◦ Training ◦ Documents ◦ Workflows  Where we are going
  3. 3. INTRODUCTION The Numbers: 1,061Participating publishers 1,936 Academic journals 8,952,463 Articles and counting 21,737 Books and counting 25,976 19th Century British Pamphlets Publishing dates: 1545 CE to 2014 CE 138 Million searches performed in 2013
  4. 4. Social Sciences Humanities History Science & Mathematics Business & Economics Arts Law Area Studies Medicine & Allied Health JSTOR Subject Areas
  5. 5. Where we left off: Pilot project Goal: To better understand the use and deployment of thesauri & test selected thesauri on our disciplines ◦ Vendor: Access Innovations (AI) Data Harmony Software: MAIStro for auto-indexing and creating a rulebase. ◦ Selection: 15k articles from 3 disciplines.  3 thesauri selected: NICEM for History, CABI for Science and AMB for Business  Auto-indexed and assigned terms.  Added to rulebase to improve indexing. ◦ Lessons learned:  Selecting the (right) thesauri is important  Rule building increases accuracy from 70% to 88%  Maintenance needs to be on-going
  6. 6. Building of the thesaurus  Access Innovations (AI) ◦ Selected, collected and imported sources from 17+ source vocabularies ◦ Merged the lists which included sorting terms hierarchically and removing duplicates. ◦ Build and tested rules ◦ Used search logs, discipline lists and access to our content  Construction standards ◦ ANSI (American National Standards Institute) ◦ NISO (National Information Standards Organization)  ANSI/NISO Z39-19.2005 ◦ ISO (International Standards Organization)  ISO 2788, ISO 5964 ◦ BS (British Standards Institute)  BS 8723 parts 1-4  AI Business thesaurus  AI Calculus thesaurus  AI Economics thesaurus  AI Geology thesaurus  AI Law thesaurus  AI Psychology thesaurus  ASIS&T  CABI  ERIC  Ethnographic thesaurus  EuroVoc  Getty Arts and Architecture thesaurus  Glossary of Statistics  MeSH (abridged)  NASA Thesaurus  NAL  NICEM  Philosopher’s Index Thesaurus  Statistics Canada  National Transportation Library
  7. 7. Where we are now  Thesaurus was officially delivered: June 2013  Continued editorial relationship with AI  Added two more JSTOR Librarians to thesaurus team  Thesaurus Statistics ◦ Preferred Terms: 56,913 ◦ Equivalent Terms: 41,608 ◦ Top Terms or Branches: 18 Terms with [at least one] Related Term: 18,965  Rulebase Statistics ◦ Rules: 100,737
  8. 8. JTHES
  9. 9. SharePoint Site
  10. 10. Maintenance and Editing Process: Training  Training Power Point ◦ Pre-training day activities-Reading through standards, attend meetings… ◦ 3-4 day hands on training in MAIstro  Weekly tasks ◦ Adding, deleting and moving terms; Changing the capitalization of terms; Searching the rulebase; Interpreting complex rules; Adding complex rules
  11. 11. Reviewing a branch  Orient yourself in the branch  Review terms in the branch  Take notes  Research the branch  Organize the branch ◦ Best practices ◦ Decision tree 11
  12. 12. Keep the following in mind when reviewing terms: ◦ Appropriacy: Is the term appropriate to the target audience? ◦ Belonging: Does the concept fit within the coverage of the thesaurus structure? ◦ Consistency: Is the term stylistically consistent with the other terms in the thesaurus structure? ◦ Currency: Does the term reflect the most current common usage for the concept? ◦ Distinctiveness: Does the term clearly represent a distinction that is important to the audience? ◦ Implication: Does the term imply additional concepts or terms? ◦ Novelty: Does the term refer to a concept that is not already in the thesaurus? ◦ Standardization: Is the term part of an authorized standard vocabulary for which there is a compliance requirement? ◦ Structure: Does a proposed new concept/term, along with others, warrant a new branch in the thesaurus? ◦ Technical Accuracy: Does the term accurately reflect the intended meaning to the intended audience? ◦ Warrant: Can you find explicit warrant (support) for your concept/term in: The JSTOR corpus, its usage, standard vocabularies for which there are compliance requirements? (i.e., user warrant & literary warrant) From,: Weise, C. Criteria for Term Selection in Your Taxonomy, Feb. 1, 2013. 12
  13. 13. Costume design is the fabrication of clothing for the overall appearance of a character or performer. Costume is specific in the style of dress particular to a nation, a class, or a period…13
  14. 14. 14 Review of Costume design branch using Word
  15. 15. 15 Review of Sociology branch using Google Docs
  16. 16. 16 Does the term already exist in the thesaurus? If no, search JSTOR How many search results are there? Less than 100 hits, do not add. More than 100 hits. Investigate . What is the term about? What journals and topics are associated with the term? Does the term appear in article-titles? citations? abstracts? Body of the text. This is subjective Add term If yes, look at where it lives and see if you can add any NPT’s or RT’s.
  17. 17. 17 Simple Rules
  18. 18. Complex rules  Proximity ◦ NEAR: Within 3 words of text-to-match. Used for phrasings of a term, prepositional phrases etc. ◦ WITH: Within the same sentence of text-to- match. This is the most common/default. ◦ AROUND: Within 50 words of text-to- match, which is approximately one paragraph. ◦ MENTIONS: Within 250 words of text-to- match, which is approximately one page. Helps cut down on noise by establishing the broadest area possible. Not used as frequently. 18
  19. 19. Complex rules continued  ALL CAPS ◦ Text to match: sat IF (ALL CAPS) USE Standardized tests  INITIAL CAPS ◦ Text to match: bush IF (INITIAL CAPS) USE U.S. Presidents  MATCH ◦ Text to match: IF (MATCH “musicianship”) USE Musicianship  BEGINS SENTENCE or ENDS SENTENCE ◦ Text to match: chronicle IF(BEGINS SENTENCE) USE History ◦ Text to match: lol IF (ENDS SENTENCE) USE Humor Remember to use Booleans! AND, OR, NOT ELSE & ELSE IF 19
  20. 20. 20
  21. 21. 21
  22. 22. Testing articles  Choose an article from JSTOR to copy/paste into Test MAI tab.  Evaluate the list of MAI suggested terms.  MAI Suggested Terms; Temperature|(54) temperature(54) Circadian rhythm|(9) temperature compensation(9) Parametric models|(6) model*(6) Biochemistry|(5) biochemical(4) biochemistry(1) Temperature dependence|(5) dependen*(3) temperature dependence(2) The term on the left side of the | is the MAI suggested term; the term on the right is the word that triggered it.  Hits – System accurately and correctly suggests indexing terms chosen by the editor. No additional rulebuiding is necessary.  Misses – System misses terms the editor uses. Reviewing articles, following a gap analysis is necessary to identify misses. Rulebuilding and possible additional term building is necessary.  Noise – System suggests terms not used by editor or incorrectly suggests a term that is used by the editor but it’s meaning is not accurately represented. Rulebuilding is necessary. 22
  23. 23. 23
  24. 24. 24
  25. 25. Maintenance and Editing Process: Documents  How-To-Guides ◦ How to configure Unicode setting in web browsers ◦ How to correct capitalization in the term record ◦ How to export a sub-branch ◦ How to install MAIstro on your computer ◦ How to remove Related Terms which appear in the same branch  Term Building Instructions  Rule Building Instructions
  26. 26. Instructions for Terms and Rule building include key terms, definitions and best practices.
  27. 27. Parking Lots  “Parking Lots” are a way to keep track of terms that we want to look into and rules we need to build.
  28. 28. Maintenance and Editing Process: Workflows  Data analysis ◦ General accuracy: Sampling of 1000 articles across content types and disciplines. ◦ Subject specific: Sampling on specific disciplines and/or journals.  New content assessment ◦ Weekly review of newly signed content  Search log review ◦ Done semiannually; Report of searched terms in JSTOR. Review ranking of terminology and how term usage changes over time. Finding new acronyms.
  29. 29. Where we want to go  Implementation onto the platform in 2014 ◦ Currently working with teams in JSTOR such as UX and Analytics to run experiments and gather metrics.  Name file  Staffing and resources ◦ Continue to train additional Librarians and create additional workflows.  SME’s ◦ Set up a system to work with SME’s
  30. 30. Thank You Contact information: