Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The JTHES as Part of the Intelligence Layer for the Sustainability Collection Prototype

Presented at the 2015 Data Harmony User Group Meeting in Albuquerque, New Mexico on February 17, 2015 by Ron Snyder and Sharon Garewal of ITHAKA Labs.

The JSTOR Sustainability Collection, which will launch in 2015, is composed of journals, reports, and working papers selected in consultation with scholars, policy researchers, and subject librarians. The collection features journal titles from academic publishers, scholarly societies, and industry groups, as well as a substantial library of indexed reports and working papers from leading research institutes and university centers. It addresses the emerging interdisciplinary discussion about how the environment and human activities and economic gains can be made durable over the long term. Along with this broad set of content, the collection will feature specialized functionality to support research in this emerging field, including a semantic indexing feature that helps researchers locate related terms and concepts that may have varying names across disciplines.

Ron Snyder, ITHAKA Labs Director of Research and Development, and Sharon Garewal, Senior Metadata Librarian, will discuss how the JSTOR Thesaurus (JTHES) was applied as part of the intelligence layer for the Sustainability collection prototype. This includes adding a facet for sustainability within the JTHES to tag terms as part of the collection, working with SME's across disciplines, and applying the curated terms into a live data portal.

  • Be the first to comment

  • Be the first to like this

The JTHES as Part of the Intelligence Layer for the Sustainability Collection Prototype

  1. 1. JSTOR Sustainability Collection Sharon Garewal, JSTOR Senior Metadata Librarian Ron Snyder, ITHAKA Labs Director of Research and Development
  2. 2. Overview  Sustainability collection defined  Utilization of the thesaurus within the sustainability collection  Subject matter experts enlisted  Results  Live demo
  3. 3. JSTOR- a quick primer  3,200+ journals & 30,000+ books  9.3 million full length articles  70 million pages  2.9 million book reviews  138 million content accesses in 2013  100 million searches per year
  4. 4. Sustainability Collection: what will it be?  Driver: Emerging interdisciplinary area that JSTOR wanted to support in both research and teaching needs.  Core topics of Cities and Urbanization, Food and Agriculture, Industrial Ecology, Resource Economics, Forestry and Land Use and Environmental Policy and Law  Composed of journals, books, grey literature (working reports, research reports, technical reports etc.)  Specialized functionality to support research by including semantic indexing to help researchers locate related terms and concepts. This is where the JSTOR Thesaurus (JTHES) comes into play!
  5. 5. JTHES 19 Top terms, 57,470 Terms; 103,129 rules
  6. 6. The challenge  To assemble a list of key terms in Sustainability  The terms will be used to organize and tag sustainability-related research articles on JSTOR starting in 2015.  These terms will also be used for an auto complete function in the search component.  Utilize the JTHES in a live prototype  This was the first project where we looked at how to use the thesaurus as an intelligence layer within a collection. How should it work? How do we do this?
  7. 7. How do we get this done? The options…  Create a new thesaurus for sustainability:  Pros: Specific to sustainability  Cons: Remembering to make changes in more than one place. Cost associated with creating and maintaining a separate thesaurus  Create a sustainability branch within JTHES:  Pros: Could BT (Broader terms) all relevant branches and terms from elsewhere in the JTHES into 1 branch  Cons: Redundant; Multiple BT’s clutter up the JTHES  Create a facet to tag terms within JTHES as “Sustainability”:  Pros: Creates a flat list (in faceted view) of all of the terms in that facet; Easy to maintain  Cons: Does not show a hierarchy; Cannot have multiple facets
  8. 8. The road to sustainability…  Research: examined existing glossaries and thesauri created by research libraries, discipline associations and individual scholars in each of the disciplines.  Existing terms (pulling lists)  Existing branches (clean up)  Adding new terms  Adding new branches: Food studies, Urban studies, etc.  Constructing new rules and refining existing rules  Testing content
  9. 9. Enlisting Subject matter experts  Contacted faculty members in ten disciplines to go over the subset of terms assembled in their discipline and review those terms with an eye toward:  Is this how people in the field express this concept?  Is it correctly included in the sustainability facet?  Are there any important terms or concepts that we've missed? (including acronyms, synonyms, variant spellings, inverted phrases)
  10. 10. SME spreadsheets Each SME was slightly different in how they approached their subject areas with some SMEs being reluctant to give much feedback and others giving large amounts of feedback to sift through. Example of terms pulled from Law, Public administration/policy and International/global studies
  11. 11. View- Facet provides alphabetical list of all tagged terms
  12. 12.
  13. 13. Implementation of the Sustainability Prototype  The thesaurus and semantic index are used for content discovery and presentation  The identification of a “sustainability collection” from the JSTOR corpus was performed using topic modeling (specifically LDA – Latent Dirichlet Allocation)  A model of 100 topics was generated from the content  Staff assigned sustainability scores for each of the topics based on a review of the top words in each topic  Each document in the JSTOR corpus was then assigned a sustainability score of 0-9 based on the sustainability scores for the topics most closely associated with the document
  14. 14. Weighting of document-level indexed terms  Document-level weights were computed for each sematic term using TF- IDF  TF-IDF is a measure of how important a word is to a document in a collection  The TF-IDF value increases proportionally to the number of times the word appears in a document (the ‘TF’ or term frequency), but is offset by how common the word is in a corpus (the ‘IDF’ or inverse document frequency)  The TF-IDF weighted terms are used to:  order the terms displayed for each document  boost document relevancy when index terms are used in discovery
  15. 15. Auto-suggest and refining results [Thesaurus slide: a new thing, metadata we create, screenshot(s) of Sustainability Portal]
  16. 16. Refinements in our use of the thesaurus and semantic index in sustainability  Auto calculation of sustainability score using LDA topics and thesaurus sustainability facet  Calculate topics and term correlations  Compute sustainability score for each topic based on the most relevant terms and sustainability facet  Compute a sustainability score for each corpus document based on topic weights and topic sustainability score  Automated LDA topic labeling  Labeling topics generated by unsupervised topic modeling is an ongoing challenge  We’re investigating the feasibility of using the same topic/term correlations used to compute sustainability scores to assign labels  Attempts to find the thesaurus term that best characterizes the most highly correlated terms for each topic
  17. 17. Other JSTOR Labs projects/tools using the thesaurus and semantic index Thesaurus Visualization Tool
  18. 18. And some other JSTOR Labs projects
  19. 19. Thank you!