Discovery and analysis of the world's research collections: JSTOR and Summon under the hood
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

on

  • 2,297 views

In the age of networked information, we've seen major changes to the ...

In the age of networked information, we've seen major changes to the
expectation of how bibliographic data is searched and serves research.
Summon is a web-scale discovery service that indexes and provides
relevancy ranking across 1 Billion items from thousands of collections and
makes them accessible to researches from a single search box in 450
institutions in over 40 countries. JSTOR is a not-for-profit provider of high
quality scholarly content spanning more than 300 years and covering nearly
60 disciplines. JSTOR provides on-line access to nearly 1,600 journals for
more than 7,500 institutions in 166 countries. This presentation will discuss
similarities in the mission and differences in the scope of these two services,
including how they work together. We'll delve into the inner workings of each
including treatment of data, analysis of search, and challenges each service
faces in their mission.
Presenters: Laura Robinson, Serials Solutions and Ron Snyder, ITHAKA

Statistics

Views

Total Views
2,297
Views on SlideShare
2,289
Embed Views
8

Actions

Likes
1
Downloads
22
Comments
0

3 Embeds 8

https://si0.twimg.com 6
https://twimg0-a.akamaihd.net 1
https://twitter.com 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • So, how do we take a good idea (web-scale discovery) and make it better?How do we take the basic principle – which is good and valuable – and use it in such a way so that it achieves a broader impact?
  • Replace w data for all JSTOR (requested)
  • For several years, the anecdotal evidence that librarians had been witnessing first-hand was beginning to be verified by user studies published by OCLC and others, as well as student surveys reported on by ProQuest and others.The evidence was overwhelming: the gateway function that libraries had played for so long – and valued so much – being THE gateway to academic research – was quickly being overtaken by web-based search engines like Google.It was one thing when undergraduates started to migrate away from the library …
  • A number of organizations had been following this trend closely - including my own (ITHAKA … which is the organizational umbrella under which JSTOR and Portico reside). We were taking a longitudinal look at faculty views about the library – and other pertinent scholarly communications issues – and comparing those view with similar survey data from librarians.One noticeable disconnect in these surveys – as you might imagine – as the perception of the “library as gateway”. Librarians believe it to be hugely important and faculty less so (science faculty much less so than humanities faculty). And students? Even less than that.Yet, the dollars being spent on access services in libraries – both software and people – were (and continue to be) tremendous. Are those expenditures aligned properly with the expectations of the users, and if they are, then how do we more effectively leverage those investments to reach a broader audience?
  • Search Result Page: Design Notes-- Link is proximate to the first search result so that it is part of the evaluation workflow (e.g. user looks at first result, decides it is no good, sees the link)-- Uses branding element that the user is familiar with … should be something that all students / users will recognize-- Customized text in the link indicates to the user that this is session-specific and workflow-specific behavior-- Uses a link which is unobtrusive and not confusing, yet noticeable (vis-à-vis a search box, which we already have too many of) Requirements Used-- Uses 16x45px max logo
  • No results page: design notes-- Mimics the JSTOR search box above, clearly indicating behavior to the user-- Button size is near the maximum size it could be and still look like a clickable button-- Key place where users are likely to “cast a wider net” (exhausted all JSTOR search results)-- Takes advantage of larger real estate and simpler page design to drive users toward this feature Requirements Used-- Uses canonical name for button text (25 characters max)-- Uses larger logo, 250x50px max
  • Article Page – JSTOR search: Design notes-- Customized text in the link indicates to the user that this is session-specific and workflow-specific behavior-- Article page is already very stuffed, esp. with CSP/Publisher stuff, so we were forced to go with something more minimalist here-- Not necessarily a core workflow for users, searching from the article page, but gives us the opportunity to expose the feature to a wide audience-- Only difference from JSTOR article page is the missing “Back To Search Results” linkRequirements Used-- Uses canonical name for link text (25 characters max)
  • Article Page – JSTOR search: Design notes-- Customized text in the link indicates to the user that this is session-specific and workflow-specific behavior-- Article page is already very stuffed, esp. with CSP/Publisher stuff, so we were forced to go with something more minimalist here-- Not necessarily a core workflow for users, searching from the article page, but gives us the opportunity to expose the feature to a wide audienceRequirements Used-- Uses canonical name for link text (25 characters max)
  • Article Page – JSTOR search: Design notes-- Customized text in the link indicates to the user that this is session-specific and workflow-specific behavior-- Article page is already very stuffed, esp. with CSP/Publisher stuff, so we were forced to go with something more minimalist here-- Not necessarily a core workflow for users, searching from the article page, but gives us the opportunity to expose the feature to a wide audienceRequirements Used-- Uses canonical name for link text (25 characters max)

Discovery and analysis of the world's research collections: JSTOR and Summon under the hood Presentation Transcript

  • 1. Discovering the World’s Research Ron Snyder Director of Advanced Technology, ITHAKA/JSTOR NASIG Annual Conference - 2012 June 9, 2012
  • 2. Who we are ITHAKA is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways. We pursue this mission by providing innovative services that aid in the adoption of these technologies and that create lasting impact. JSTOR is a research platform that enables discovery, access, and preservation of scholarly content.
  • 3. JSTOR Factoids • Started in 1997 • Journals online: 1,604 • Articles online: 7.5 million • Disciplines covered: 60 • Participating institutions: 7,800 • Countries with participating institutions: 167
  • 4. JSTOR site activity User Sessions (visits) » New Sessions (per hour):  70k peak, 38k average » Simultaneous Sessions:  44k peak, 21k average Page Views » 3.5M per day, 6.7M peak Content Accesses » 430k per day, 850K peak Searches » 456k per day, 1.13M peak
  • 5. ITHAKA/JSTOR Discovery Initiatives • Overhaul of JSTOR Search Infrastructure • Coming Soon (Summer 2012), watch for it… • Analytics and data warehouse • Capability for ingesting, organizing, and analyzing billions of usage events since JSTOR inception • Improved external discoverability • Various SEO, Google/GS, MS-Academic projects • Local Discovery Integration (LDI) Pilot • Machine-based document classification
  • 6. JSTOR Search Data Analysis Some Early Findings
  • 7. Site Search Activity, by type Locator, 0. Saved, 0.05 27% % Advanced, 21.49% Basic, 78.2 0% 2009 2010 2011 2012 Basic 68.8% 71.3% 77.4% 78.2% Advanced 30.5% 28.1% 22.3% 21.5%6.3M Sessions Locator 0.6% 0.6% 0.3% 0.2%19.8M Searches(from Mar-5 to Apr-16, 2012)
  • 8. Search Pages Viewed 4, 1.2% 5+, 1.7% 3, 2.6% 2, 9.4% 1, 85.1% 1.3 Search Results Pages Viewed per Search
  • 9. Search Expression Profile Average # of Terms Entered 3.9 Use of phrases 6.9% Use of boolean expressions 7.0% Use of fielded expressions 3.5%
  • 10. Searches Per Search SessionAverage Searches per session: • Overall 3.1 • Successful 3.8 • Unsuccessful 2.1 60% 50% 40% 30% Successful Unsuccessful 20% 10% 0% 1 2 3 4 5+ Searches Performed Persistence pays off…
  • 11. Click Thru Rates by Search Position JSTOR – 20M searches from March 5 – Apr 16 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
  • 12. Local Discovery Integration Pilot JSTOR and Summon
  • 13. Problem Statement: » Research has shown time and again that both students and faculty are beginning their research at places other than the library OPAC, most notably Google/Google Scholar and discipline-specific electronic databases, and that the trend is continuing Starting point for research, identified by faculty in 2003, 2006, and 2009 (2009 Faculty Study, ITHAKA) 100% 90% 2003 80% 2006 70% 2009 60% 50% 40% 30% 20% 10% 0% The library building online librarygeneral-purpose specific engine Your A catalog A search electronic research resource
  • 14. Where is discovery happening? Where JSTOR ‘sessions’ originated | Jan 2011 – Dec 2011
  • 15. Problem Statement: » As web-scale discovery services are being purchased and implemented by institutions, the value of those implementations are somewhat limited because they are (for the most part) only addressing that limited population of researchers who begin at a library-designated starting point (e.g. OPAC) JSTOR usage | Australia | 2010 Nov. JSTOR Google/Google Scholar Known Linking Partner Library 16% 6% 9% 76% 2%
  • 16. Research Behavior: Students What is the easiest place to start research according to students?Library Databases Google 0 10 20 30 40 50 60 70 Source: ProQuest survey of student research habits, 2007
  • 17. Research Behavior: Faculty Starting Point for Research, identified by faculty in 2003, 2006, and 2009 100% 90% 2003 80% 2006 70% 2009 60% 50% 40% 30% 20% 10% 0% The library building online librarygeneral-purpose specificengine Your A catalog A search electronic research resource Source: ITHAKA 2009 Faculty Survey, 2010
  • 18. Concept: » If we can more effectively reach the users at the place(s) where they normally begin their research, then we can begin to more effectively build their awareness of the resources that the institution has licensed/purchased for their purposes » The local discovery integration (LDI) pilot study will attempt to measure changes in the student/faculty research experience by „embedding‟ the institution‟s selected web-scale discovery service in strategically-selected places in the JSTOR interface where – we believe – the user would naturally want to „cast a wider net‟ for discovery 2010 JSTOR Usage Highlights Total Significant Accesses 594,888,001 Articles Downloaded 74,901,344 Articles Viewed 112,751,906 Searches Performed 168,720,887 Inbound Links from Licensed Partners 13,013,904 Inbound Links from Google/Scholar 157,903,053
  • 19. How it works Links Out • Search Results  Advanced Search Page  Search Results View • 3rd Page “Lightbox” pop-up • Article View - Incoming from Google • Article View - All other non-Google • Zero Results PageWe placed links at various places along the research workflow inJSTOR to allow students and researchers to “Cast a wider net”
  • 20. Search results page » JSTOR may not be the most appropriate starting place in every instance, but it is a trusted and familiar interface. This will allow the user to „flowback‟ to another starting place (e.g. the library) • Uses the familiar university logo to grab attention • Inserts search terms into link text to notify user of customized behavior • Positioned proximate to search results; relevant during the search result evaluation phase
  • 21. Empty results page » In this instance, the user has found nothing and the most typical web response is to hit the „Back‟ button. If we allow the user – at this point – to execute a search in the local discovery interface, we might improve the user experience • One of the key places where a user is likely to want to try a different, broader search • Larger placement takes advantage of available real estate and cognitive space • Users typically do not spend time on this page so it is important to increase notice-ability and self- explanation
  • 22. Article page after Google search » In 2010, over 32M Google/Google Scholar searches brought users directly to an article page. They may or may not have found what they really wanted, so we‟d like to give them an alternative discovery choice • Visible when coming from a Google or Google Scholar search • Captures basic search terms from the search • Provides an opportunity to convert a user from a Google/Google Scholar user to a Summon user
  • 23. Article page after JSTOR search » In 2010, almost 113M articles were viewed in JSTOR. Again, they may not have found what they really wanted, so we‟d like to give them an alternative discovery choice • Visible when coming from a JSTOR search • Raises visibility of the feature by exposing it to a large number of users • Inserts search terms into link text to notify user of customized behavior
  • 24. Results View: All PagesLink out from thebottom of all pages ofthe search results view.This will allow moreopportunities to link outfor students/ researcherscombing through largesets of results.
  • 25. Results View: 3rd PagePop-up on the third page of search resultsPrompts the student/ researcher to indicate whether they wish to link out through the LDI. Thiswill enable us to measure whether students wish to “cast a wider net” or not. In the other linkscenarios we don’t have a baseline of how many students do not notice the link vs. choose notto use it
  • 26. Link out to Discovery Platform
  • 27. What we intend to track: 1. To which Summon domain JSTOR is sending the user (usyd, ncsu, asu, etc) 2. Where the local discovery request originated within JSTOR (search results page, null results page, article view page) • outpage=searchresults • outpage=noresults • outpage=pageview 3. If the user is identified as having come from Google or Google Scholar and clicks on the “pageview” link, we will capture that information 4. We will be providing the following identifiers to Summon to allow tracking on their end: • origin=JSTORpagesummon • origin=JSTORsearchsummon • origin=JSTORnoresultssummon
  • 28. Some Preliminary Results » Highest usage occurred in Zero Results scenarioData shown is for all institutions participating in Summon LDIDate range: July 2011 – February 2012
  • 29. Machine-Based Article Classifier Assigning Articles to Disciplines
  • 30. The Problem JSTOR Corpus • 60 disciplines • 1,600 journals • Nearly 8 million articles • Disciplines are associated at the Journal level • All articles in a Journal inherit the Journal assigned disciplines • Using this approach many articles have incomplete and/or incorrect discipline tagging hindering discovery • How to assign disciplines to articles?
  • 31. Topic Models • Human classification and tagging is not feasible • A machine-based classification process is desired • Topic models are a way of finding structure in a set of documents • They allow is to find “latent” themes • A topic model is not a topic map • Some topic modeling approaches include • Latent Semantic Analysis (LSI/LSA) (Deerwester 1990) • Probabilistic LSA (Hoffmann 1999) • Latent Dirichlet Allocation (LDA) (Blei 2003)
  • 32. Topic Modeling – our approach LDA – Latent Dirichlet Allocation • A generative probabilistic model for analyzing collections of documents • A Bayesian model where each document is modeled as a mixture of topics (disciplines) • Models semantic relationships between documents based on word co-occurrences
  • 33. The Process • We select the most representative documents from each JSTOR discipline to build a topic model (from the vocabulary of the document sample) • This sampling and vocabulary modeling is the most important part of the process! • We’re still experimenting with this, but find the citation network provides a good means for identifying core documents in a discipline • Also considering whether usage data might be leveraged here • Each document in the corpus is then analyzed and compared to the topic model to determine how well it matches each topic • A probability distribution is generated providing discipline weights • The top weighted discipline(s) are associated with each article
  • 34. Application • On-site discovery • Will be a key element of our overhauled search infrastructure, tentatively scheduled for beta release mid-summer • Use in article-level discipline/subject/topic mappings for better integration with aggregated indexes • Will support a richer data feed for Summon, for instance