NASIG 2012 - Discovering the World's Research (ITHAKA portion)

Discovering the World’s Research

Ron Snyder
Director of Advanced Technology, ITHAKA/JSTOR
NASIG Annual Conference - 2012
June 9, 2012

Who we are

ITHAKA is a not-for-profit organization that helps the academic
community use digital technologies to preserve the scholarly
record and to advance research and teaching in sustainable ways.

We pursue this mission by providing innovative services that aid in
the adoption of these technologies and that create lasting impact.

JSTOR is a research platform that enables
discovery, access, and preservation of scholarly
content.

JSTOR Factoids

• Started in 1997
• Journals online: 1,604
• Articles online: 7.5 million
• Disciplines covered: 60
• Participating institutions: 7,800
• Countries with participating institutions: 167

JSTOR site activity

User Sessions (visits)
» New Sessions (per hour):
 70k peak, 38k average
» Simultaneous Sessions:
 44k peak, 21k average
Page Views
» 3.5M per day, 6.7M peak
Content Accesses
» 430k per day, 850K peak
Searches
» 456k per day, 1.13M peak

ITHAKA/JSTOR Discovery Initiatives

• Overhaul of JSTOR Search Infrastructure
• Coming Soon (Summer 2012), watch for it…
• Analytics and data warehouse
• Ingesting, organizing, and analyzing billions of usage
events since JSTOR inception
• Improved external discoverability
• Various SEO, Google/GS, MS-Academic projects
• Local Discovery Integration (LDI) Pilot
• Machine-based document classification

Local Discovery Integration Pilot
JSTOR and Summon

Problem Statement:

» Research has shown time and again that both students and faculty are
beginning their research at places other than the library OPAC, most
notably Google/Google Scholar and discipline-specific electronic
databases, and that the trend is continuing
Starting point for research, identified by faculty in 2003, 2006, and 2009 (2009 Faculty Study, ITHAKA)
100%
90%
2003
80%
2006
70%
2009
60%
50%
40%
30%
20%
10%
0%
The library building online librarygeneral-purpose specific engine
Your A catalog A search electronic research resource

Where is discovery happening?

Where JSTOR ‘sessions’ originated | Jan 2011 – Dec 2011

Problem Statement:

» As web-scale discovery services are being purchased and
implemented by institutions, the value of those implementations
are somewhat limited because they are (for the most part) only
addressing that limited population of researchers who begin at
a library-designated starting point (e.g. OPAC)

JSTOR usage | Australia | 2010 Nov.
JSTOR Google/Google Scholar Known Linking Partner Library

16%

6%
9%

76%
2%

Research Behavior: Students

What is the easiest place to start research
according to students?

Library Databases

Google

0 10 20 30 40 50 60 70
Source: ProQuest survey of student research habits, 2007

Research Behavior: Faculty

Starting Point for Research, identified by faculty in 2003, 2006, and 2009

100%
90%
2003
80%
2006
70%
2009
60%
50%
40%
30%
20%
10%
0%
The library building online librarygeneral-purpose specificengine
Your A catalog A search electronic research resource
Source: ITHAKA 2009 Faculty Survey, 2010

Concept:

» If we can more effectively reach the users at the place(s) where
they normally begin their research, then we can begin to more
effectively build their awareness of the resources that the
institution has licensed/purchased for their purposes
» The local discovery integration (LDI) pilot study will attempt to
measure changes in the student/faculty research experience by
„embedding‟ the institution‟s selected web-scale discovery
service in strategically-selected places in the JSTOR interface
where – we believe – the user would naturally want to „cast a
wider net‟ for discovery
2010 JSTOR Usage Highlights
Total Significant Accesses 594,888,001
Articles Downloaded 74,901,344
Articles Viewed 112,751,906
Searches Performed 168,720,887
Inbound Links from Licensed Partners 13,013,904
Inbound Links from Google/Scholar 157,903,053

How it works

Links Out

• Search Results
 Advanced Search Page
 Search Results View
• 3rd Page “Lightbox” pop-up
• Article View - Incoming from Google
• Article View - All other non-Google
• Zero Results Page

We placed links at various places along the research workflow in
JSTOR to allow students and researchers to “Cast a wider net”

Search results page

» JSTOR may not be the most appropriate starting place in every
instance, but it is a trusted and familiar interface. This will allow
the user to „flowback‟ to another starting place (e.g. the library)

• Uses the familiar
university logo to grab
attention

• Inserts search terms into
link text to notify user of
customized behavior

• Positioned proximate to
search results; relevant
during the search result
evaluation phase

Empty results page

» In this instance, the user has found nothing and the most typical web
response is to hit the „Back‟ button. If we allow the user – at this point –
to execute a search in the local discovery interface, we might improve
the user experience
• One of the key places
where a user is likely to
want to try a different,
broader search

• Larger placement takes
advantage of available real
estate and cognitive space

• Users typically do not
spend time on this page so
it is important to increase
notice-ability and self-
explanation

Article page after Google search

» In 2010, over 32M Google/Google Scholar searches brought users
directly to an article page. They may or may not have found what they
really wanted, so we‟d like to give them an alternative discovery choice

• Visible when coming
from a Google or Google
Scholar search

• Captures basic search
terms from the search

• Provides an opportunity
to convert a user from a
Google/Google Scholar
user to a Summon user

Article page after JSTOR search

» In 2010, almost 113M articles were viewed in JSTOR. Again, they may
not have found what they really wanted, so we‟d like to give them an
alternative discovery choice

• Visible when coming
from a JSTOR search

• Raises visibility of the
feature by exposing it to a
large number of users

• Inserts search terms
into link text to notify user
of customized behavior

Results View: All Pages

Link out from the
bottom of all pages of
the search results view.
This will allow more
opportunities to link out
for students/ researchers
combing through large
sets of results.

Results View: 3rd Page

Pop-up on the third page of search results
Prompts the student/ researcher to indicate whether they wish to link out through the LDI. This
will enable us to measure whether students wish to “cast a wider net” or not. In the other link
scenarios we don’t have a baseline of how many students do not notice the link vs. choose not
to use it

Link out to Discovery Platform

Results Overview

» Highest usage occurred in Zero Results scenario

Data shown is for all institutions participating in Summon LDI
Date range: July 2011 – February 2012

Machine-Based Article Classifier
Assigning Articles to Disciplines

The Problem

JSTOR Corpus
• 60 disciplines
• 1,600 journals
• Nearly 8 million articles

• Disciplines are associated at the Journal level
• All articles in a Journal inherit the Journal assigned
disciplines
• Using this approach many articles have incomplete
and/or incorrect discipline tagging hindering discovery

• How to assign disciplines to articles?

Topic Models

• Human classification and tagging is not feasible
• A machine-based classification process is desired

• Topic models are a way of finding structure in a set of
documents
• They allow is to find “latent” themes
• A topic model is not a topic map
• Some topic modeling approaches include
• Latent Semantic Analysis (LSI/LSA) (Deerwester 1990)
• Probabilistic LSA (Hoffmann 1999)
• Latent Dirichlet Allocation (LDA) (Blei 2003)

Topic Modeling – our approach

LDA – Latent Dirichlet Allocation
• A generative probabilistic model for analyzing
collections of documents
• A Bayesian model where each document is modeled
as a mixture of topics (disciplines)
• Models semantic relationships between documents
based on word co-occurrences

The Process

• We select the most representative documents from
each JSTOR discipline to build a topic model (from
the vocabulary of the document sample)
• This sampling and vocabulary modeling is the most important part
of the process!
• We’re still experimenting with this, but find the citation network
provides a good means for identifying core documents in a
discipline
• Also considering whether usage data might be leveraged here
• Each document in the corpus is then analyzed and
compared to the topic model to determine how well it
matches each topic
• A probability distribution is generated providing discipline weights
• The top weighted discipline(s) are associated with each article

Application

• On-site discovery
• Will be a key element of our overhauled search
infrastructure, tentatively scheduled for beta release mid-summer
• Use in article-level discipline/subject/topic mappings
for better integration with aggregated indexes
• Will support a richer data feed for Summon, for instance

NASIG 2012 - Discovering the World's Research (ITHAKA portion)

Recommended

Recommended

More Related Content

Similar to NASIG 2012 - Discovering the World's Research (ITHAKA portion)

Similar to NASIG 2012 - Discovering the World's Research (ITHAKA portion) (20)

Recently uploaded

Recently uploaded (20)

NASIG 2012 - Discovering the World's Research (ITHAKA portion)

Editor's Notes