Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
JSTOR Sustainability
Collection
Sharon Garewal, JSTOR Senior Metadata Librarian
Ron Snyder, ITHAKA Labs Director of Resear...
Overview
 Sustainability collection defined
 Utilization of the thesaurus within the sustainability collection
 Subject...
JSTOR- a quick primer
 3,200+ journals & 30,000+ books
 9.3 million full length articles
 70 million pages
 2.9 millio...
Sustainability Collection: what will it
be?
 Driver: Emerging interdisciplinary
area that JSTOR wanted to
support in both...
JTHES
19 Top terms, 57,470 Terms;
103,129 rules
The challenge
 To assemble a list of key terms in Sustainability
 The terms will be used to organize and tag sustainabil...
How do we get this done? The
options…
 Create a new thesaurus for sustainability:
 Pros: Specific to sustainability
 Co...
The road to sustainability…
 Research: examined existing
glossaries and thesauri created
by research libraries, disciplin...
Enlisting Subject matter experts
 Contacted faculty members in ten disciplines to go over the subset of terms
assembled i...
SME spreadsheets
Each SME was slightly
different in how they
approached their
subject areas with
some SMEs being
reluctant...
View- Facet
provides
alphabetical
list of all
tagged terms
labs.jstor.org/sustainability
Implementation of the Sustainability
Prototype
 The thesaurus and semantic index are used for content discovery and
prese...
Weighting of document-level indexed
terms
 Document-level weights were computed for each sematic term using TF-
IDF
 TF-...
Auto-suggest and refining results
[Thesaurus slide: a new thing, metadata we create, screenshot(s) of
Sustainability Porta...
Refinements in our use of the thesaurus
and semantic index in sustainability
 Auto calculation of sustainability score us...
Other JSTOR Labs projects/tools using
the thesaurus and semantic index
http://labs.jstor.org/jthes/
http://labs.jstor.org/...
And some other JSTOR Labs projects
http://labs.jstor.org/reflowit/
http://labs.jstor.org/shakespeare/
Thank you!
Sharon.Garewal@ithaka.org
Ronald.Snyder@ithaka.org
JSTOR Sustainability Collection - DHUG 2015
Upcoming SlideShare
Loading in …5
×

JSTOR Sustainability Collection - DHUG 2015

365 views

Published on

Presentation at the 2015 Digital Harmony Users Group workshop describing the use of the JSTOR thesaurus in the prototyping of new discovery tools for a future sustainability collection.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

JSTOR Sustainability Collection - DHUG 2015

  1. 1. JSTOR Sustainability Collection Sharon Garewal, JSTOR Senior Metadata Librarian Ron Snyder, ITHAKA Labs Director of Research and Development
  2. 2. Overview  Sustainability collection defined  Utilization of the thesaurus within the sustainability collection  Subject matter experts enlisted  Results  Live demo
  3. 3. JSTOR- a quick primer  3,200+ journals & 30,000+ books  9.3 million full length articles  70 million pages  2.9 million book reviews  138 million content accesses in 2013  100 million searches per year http://www.jstor.org/
  4. 4. Sustainability Collection: what will it be?  Driver: Emerging interdisciplinary area that JSTOR wanted to support in both research and teaching needs.  Core topics of Cities and Urbanization, Food and Agriculture, Industrial Ecology, Resource Economics, Forestry and Land Use and Environmental Policy and Law  Composed of journals, books, grey literature (working reports, research reports, technical reports etc.)  Specialized functionality to support research by including semantic indexing to help researchers locate related terms and concepts. This is where the JSTOR Thesaurus (JTHES) comes into play!
  5. 5. JTHES 19 Top terms, 57,470 Terms; 103,129 rules
  6. 6. The challenge  To assemble a list of key terms in Sustainability  The terms will be used to organize and tag sustainability-related research articles on JSTOR starting in 2015.  These terms will also be used for an auto complete function in the search component.  Utilize the JTHES in a live prototype  This was the first project where we looked at how to use the thesaurus as an intelligence layer within a collection. How should it work? How do we do this?
  7. 7. How do we get this done? The options…  Create a new thesaurus for sustainability:  Pros: Specific to sustainability  Cons: Remembering to make changes in more than one place. Cost associated with creating and maintaining a separate thesaurus  Create a sustainability branch within JTHES:  Pros: Could BT (Broader terms) all relevant branches and terms from elsewhere in the JTHES into 1 branch  Cons: Redundant; Multiple BT’s clutter up the JTHES  Create a facet to tag terms within JTHES as “Sustainability”:  Pros: Creates a flat list (in faceted view) of all of the terms in that facet; Easy to maintain  Cons: Does not show a hierarchy; Cannot have multiple facets
  8. 8. The road to sustainability…  Research: examined existing glossaries and thesauri created by research libraries, discipline associations and individual scholars in each of the disciplines.  Existing terms (pulling lists)  Existing branches (clean up)  Adding new terms  Adding new branches: Food studies, Urban studies, etc.  Constructing new rules and refining existing rules  Testing content
  9. 9. Enlisting Subject matter experts  Contacted faculty members in ten disciplines to go over the subset of terms assembled in their discipline and review those terms with an eye toward:  Is this how people in the field express this concept?  Is it correctly included in the sustainability facet?  Are there any important terms or concepts that we've missed? (including acronyms, synonyms, variant spellings, inverted phrases)
  10. 10. SME spreadsheets Each SME was slightly different in how they approached their subject areas with some SMEs being reluctant to give much feedback and others giving large amounts of feedback to sift through. Example of terms pulled from Law, Public administration/policy and International/global studies
  11. 11. View- Facet provides alphabetical list of all tagged terms
  12. 12. labs.jstor.org/sustainability
  13. 13. Implementation of the Sustainability Prototype  The thesaurus and semantic index are used for content discovery and presentation  The identification of a “sustainability collection” from the JSTOR corpus was performed using topic modeling (specifically LDA – Latent Dirichlet Allocation)  A model of 100 topics was generated from the content  Staff assigned sustainability scores for each of the topics based on a review of the top words in each topic  Each document in the JSTOR corpus was then assigned a sustainability score of 0-9 based on the sustainability scores for the topics most closely associated with the document
  14. 14. Weighting of document-level indexed terms  Document-level weights were computed for each sematic term using TF- IDF  TF-IDF is a measure of how important a word is to a document in a collection  The TF-IDF value increases proportionally to the number of times the word appears in a document (the ‘TF’ or term frequency), but is offset by how common the word is in a corpus (the ‘IDF’ or inverse document frequency)  The TF-IDF weighted terms are used to:  order the terms displayed for each document  boost document relevancy when index terms are used in discovery
  15. 15. Auto-suggest and refining results [Thesaurus slide: a new thing, metadata we create, screenshot(s) of Sustainability Portal]
  16. 16. Refinements in our use of the thesaurus and semantic index in sustainability  Auto calculation of sustainability score using LDA topics and thesaurus sustainability facet  Calculate topics and term correlations  Compute sustainability score for each topic based on the most relevant terms and sustainability facet  Compute a sustainability score for each corpus document based on topic weights and topic sustainability score  Automated LDA topic labeling  Labeling topics generated by unsupervised topic modeling is an ongoing challenge  We’re investigating the feasibility of using the same topic/term correlations used to compute sustainability scores to assign labels  Attempts to find the thesaurus term that best characterizes the most highly correlated terms for each topic
  17. 17. Other JSTOR Labs projects/tools using the thesaurus and semantic index http://labs.jstor.org/jthes/ http://labs.jstor.org/snap/ http://labs.jstor.org/readings/ Thesaurus Visualization Tool
  18. 18. And some other JSTOR Labs projects http://labs.jstor.org/reflowit/ http://labs.jstor.org/shakespeare/
  19. 19. Thank you! Sharon.Garewal@ithaka.org Ronald.Snyder@ithaka.org

×