2010 CSHE paper “Assessing the future landscape of scholarly communications”, Harley, Diane et al.
Comments: Nature 464, 466 (25 March 2010) http://www.nature.com/nature/journal/v464/n7288/full/464466a.html
A sample of what a very simple text analysis API can look like – plot the occurrence of ‘malivinas’ over time. The subtlety is the linear decrease in interest after the post war spike.
Trying to make it very clear that datasets are a different and more central element of all scholarly research (with the possible exception of maths, philosophy and religion). Data both inspires and confirms ideas – text is mostly informative, rarely inspirational.Highlights are the discussion pointsii. In the physical sciences the half-life of content access frequency is ~ 6-8 years.Grey text is a PQ business value, not a user value.
Whilst content can be obfuscated or reduced, there are thorny issues with usage data. Early policy decisions need to be taken with respect to exposing usage data, even indirectly ( triangulation is always possible ).--1 Oratio has shifted from ‘speech’ to ‘prayer’ and back again in the latin literature. See Greg Crane et al.
Note that the number of articles is small anyway, so the data could simply be random variation. This is way too simple a tool for serious analysis.
Figures on faculty demographics from http://nces.ed.gov/programs/digest/d09Sources in earlier paper on datasets.
Trends influencing future scholarshp
Trends Influencing Future Scholarship
May 31, 2012
Technology is changing how we structure work
cd, dvd, itunes
Spotify, youtube (streamed)
Towards reproducible research
context, quality, trust
means easy access to the
Emerging trend of journals and publishers linking to openaccess data repositories
Journals and funding agencies setting policy to preserve
and associate data supporting research results
Changing Global Research Patterns
The center of gravity of the world system of scholarship
is moving from west to east.
NOTES: Asia-10 includes
China, Japan, India,
South Korea, Taiwan, and
Science Board, Science
and Engineering Indicators
Citations of U.S. research articles in non-U.S.
literature, by region/country: 1998–2010
Asia-8 = India, Indonesia, Malaysia, Philippines, Singapore, South
Korea, Taiwan, Thailand; EU = European Union
Share of citations to
international literature: 2000–10
India, Indonesia, Malaysia, Philippines, Singapore, South
Korea, Taiwan, Thailand; EU = European Union
Citations from Asia 10 Articles
NOTES: Asia-10 includes China, Japan, India, Indonesia, Malaysia, Philippines,
Singapore, South Korea, Taiwan, and Thailand. Asia-8 excludes China and Japan.
SOURCE: National Science Board, Science and Engineering Indicators 2010
US Academic Expenditures on Research by
(Millions of current dollars)
SOURCE: National Science
Foundation/Division of Science
Resources Statistics, Survey of
Research and Development
Expenditures at Universities and
Colleges: FY 2008.
Master’s degrees conferred
Humanities Other fields
SOURCE: Survey of
Academic work is social
2006 Univ. of Minn.
68% - Faculty work
52% - Collaborate with
colleagues at other
46% - Find the distance
from colleagues is a
“One group of experts
can’t do everything”
SOURCE: Newman M E J PNAS 2001;98:404-409
Low threshold (at good
High threshold (competitive)
Molecular and cell
Junior faculty in all fields are especially
cautious for fear of theft and/or
Changing Environment of Research
Book & Journal
The “center” of research is shifting from libraries to other sources
Changing Resource Expenditures
SOURCE: NCES 2010, 2008, 2006, 2004, 2002 Supplemental tables for Academic Libraries
Books, serial backfiles and other
Overall % by content category
Books, serial backfiles
and other materials
Current Serial Subscriptions
Overall % of purchased and
Changing Discovery Methods
Data from Evans (2008
Activities are the
context for when our
content is used in
“whole is greater than
Critical for ecosystem
Research Network Connections
Era of connections
The Interconnected Article
Has a content edge over
print: more of it and more
Evolution of search
We are here
Going from metadata about objects and text search to
ideas, context and mining
Greater granularity of discovery
Structural analysis of content
Internationalization of search
Semantic Search utilizes robust data structures like
ontologies to apply domain knowledge to otherwise twodimensional terms.
The application of word context provides a dynamic
aspect to semantic search, allowing the user’s real-time
intent to guide results.
Contrast with static thesauri and controlled vocabularies
which miss nuances of context and intent.
Automated processing of library
PubMed contains ~17,787,763 articles to
Manually searching is tedious and
Can be hard finding links between data
Conclusion? Machines will be reading the
Workflows, researcher Paul Fisher found
Link between cholesterol, patient trauma
and parasite resistance in cattle.
Arrowsmith LBD: the ABC Model
Articles about an AB relationship
dietary fish oil
Articles about a BC relationship
AB and BC are complementary but disjoint : They can reveal an implicit
relationship between A and C in the absence of any explicit relation.
The researcher assesses titles in the B literature identified by the system
for fit or contribution to problem.
Content Presented as Data
Incidence of “Malvinas”)
Something happened !
Purpose of Content Analytics
inspires or proves ideas
The true center of
They occupy different points in the
scholarly information lifecycle.
What does Content Analytics do?
Collaborative man-machine exploration
highlights trends, clues or anomalies [visualization – leverages
On demand Analysis.
identify and quantify trends, relationships, concepts and
correlations. (tools: SEASR, nltk , autonomy, … )
generate new ‘facets’ or annotations for discovery [augments
Older content is read lessii, but remains important for trend
analysis and statistical significance. [value shifts]
Examples of text as data
Changes in word sense ( e.g. consumption( TB )
, moot, oratio1 ) and spelling (e.g. 18th C. ſ to s , *re
Bibliometrics and other usage analyses
Institution vs. discipline
Pharma: Drug / Symptom correlation.
Biology: Species / date / location observations.
Social Sci: Work/life habits of undergrads based on
access patterns at different institutions [ usage data based]
Unstructured text to queryable data structures
TOO MUCH TEXT TO HAND ANALYZE.
Improved discovery ( better ‘metadata’ )
e.g. content stats -> content acquisitions
E.g. Distribution of authors vs. disciplines vs. grants
End User research agendas
High-End : Custom (user specified) mining as a service
Simple : Visualization of results ( frequency / co-occurrence … )
Options for use of text mining
Many, many options – it is about capability
disambiguation of people, places and events ()
concept labeling ( different terms, same ‘thing’)
E.g. extract institutions from theses (demographics).
E.g. taxon labeling in biology.
Discovery tools , e.g.
topic labeling (`natural` disciplines)
reading level (grade, UG, PG … )
Structural analysis ( important parts of doc )
Boilerplate vs. ‘meat’
E.G. “White Plague”
Heroin , TB
Datasets: Factoids & point data
ca. 1.4M Faculty ( 50% full-time ) in US HE, ~75M people enrolled in US HE
ca. 100k Faculty in UK HE
44% of Researchers use online (other people’s) datasets for their research
48% of Researchers use datasets > 1GB
10.8% store their data outside their institution ( 50% store it in their “lab”)
1 - 5% of datasets are formally moved into the curation process.
66% of faculty have requested other people’s data ( and 49% of those got it).
[ 26.5% have the expertise to analyze their own data.
[ 80.3% do not have sufficient expertise to manage their own data
Institutional storage costs ~ $600 / TB / year
[ 58% is the annual increase in the amount of data being generated
[ 20-40% is annual growth in the amount of storage deployed (est.)
< 1% of ecological data is accessible after publication.
> 85% of all information is in text form
2.7 times more citations accrue to papers with accessible data
3 to 6 times more papers emerge if the data is accessible.
Drivers of change
Nearly ubiquitous high-speed wireless globally
Global technology innovation
Policy shifts in Academia
Internationalization of scholarship
Growth in primary source datasets
Fearless and connected entrepreneurs
Fearless and connected researchers