This presentation was provided by Corey Harper of Elsevier Labs during the NISO webinar, Using Analytics to Extract Value from the Library's Data, held on September 12, 2018.
Overview on Edible Vaccine: Pros & Cons with Mechanism
Harper Analytics Beyond Usage Numbers
1. Analytics Beyond
Usage Numbers
Presented for NISO, September 12, 2018
Corey A Harper (@chrpr)
Applying analytics to metadata,
content, and research
Many thanks for contributions from Brad Allen, Jessica Cox,
Ron Daniel, Helena Deus, Paul Groth, Darin McBeath, and
Tony Scerri
2. September 12, 2018
Analytics Beyond Usage Numbers
• Introduction to analytics
• Metadata analytics – a DPLA case study
- Metadata visualization
- Metadata completeness & effects on usage
• Information analytics
- “A global information analytics company”
- To support linked data & knowledge graphs
- Examples
• Tools, recommended practice, and conclusion
2
6. Quantifying metadata – a case study
September 12, 2018
Analytics Beyond Usage Numbers
6
• ”Record completeness”
• Field distributions and statistics
• Usage data and query language
• Natural language processing
- Query language
- Record full text
- Field full text
• Term and bi-gram frequency
• Topic modeling
• Metadata impact on usage
https://journal.code4lib.org/articles/11752
Caveat: data and graphics are from 2016
7. Average # of subjects by provider
September 12, 2018
Analytics Beyond Usage Numbers
7
8. Percentage of records with subject
September 12, 2018
Analytics Beyond Usage Numbers
8
21. “A global information analytics business”
September 12, 2018
Analytics Beyond Usage Numbers
21
22. To help customers answer questions at point of need:
Elsevier combines content with technology
to provide actionable knowledge
Operational Excellence
Content Technology
Chemistry database
500m published experimental facts
User queries
13m monthly users on ScienceDirect
Books
35,000 published books
Drug Database
100% of drug information from
pharmaceutical companies updated daily
Research
16% of the world’s research data and
articles published by Elsevier
1,000
technologists employed by Elsevier
Machine learning
Over 1,000 predictive models trained on 1.5
billion electronic health care events
Machine reading
475m facts extracted from
ScienceDirect
Collaborative filtering:
1bn scientific articles added by 2.5m
researchers analyzed daily to generate over
250m article recommendations
Semantic Enhancement
Knowledge on 50m chemicals captured as 11B
facts
22
23. Elsevier Labs
September 12, 2018
Analytics Beyond Usage Numbers
• Reports into Architecture / Technology
• Mix of Researchers and very experienced software
developers
• Two main modes of work
- Targeted Research, primarily stuff that’s still 2-3 years out
- Accelerated Development, in partnership with and
support of product groups
• Applying state of the art research to medical and
scientific domain
23
24. Differences in citation language
September 12, 2018
Analytics Beyond Usage Numbers
24
Researchers have successfully
reprogrammed somatic cells into stem-
like cells – known as induced pluripotent
stem cells (iPSCs) – which share many
of the characteristics of ESCs (Takahashi
and Yamanaka, 2006).
Human nephron progenitors were
induced from iPSCs (201B7) (Takahashi
and Yamanaka, 2006), based on the
protocol that we previously established
(Taguchi et al., 2014).
Materials and Methods Introduction
25. Citation language pre- & post- Nobel Prize
September 12, 2018
Analytics Beyond Usage Numbers
25
https://openaccess.leidenuniv.nl/handle/1887/65351
26. AnnotationQuery
September 12, 2018
Analytics Beyond Usage Numbers
26
This allows us to search for:
• a <sentence>
• in the <methods_section>
• that contains
• a citation to to
• A Nobel Prize Paper
(“Nobel_papers.txt”)
https://github.com/elsevierlabs-os/AnnotationQuery
27. Building blocks for text analytics
September 12, 2018
Analytics Beyond Usage Numbers
• Original Markup
Annotations
• Part of Speech Annotations
• Sentences, Paragraphs,
Noun Phrases, Verb
Phrases
• Dependency and
Constituency Parse Trees
• Query Across Annotation
Sets!
27
28. Additional use cases
September 12, 2018
Analytics Beyond Usage Numbers
Units and Measurements
• Find a <numeric pattern>
12 ± 3, 53–55, 0.245
• Followed by a <unit of
measurement>
°C, μM, hours, h, MPa,
28
Temperatures
• Find a <U&M>
• Of type <temperature>
• With the word
<housing>
• in the same <sentence>
• in the <methods_section>
29. Units and measurements
September 12, 2018
Analytics Beyond Usage Numbers
• Nanoamperes (nA) for neural cell Rheobase values
• Megapascals (mPa) for compressive strength of concrete
• Milligrams per Kilogram (mg/kg) for administered drug
dosages
29
34. Information analytics enables:
September 12, 2018
Analytics Beyond Usage Numbers
• Datasets aggregated across the literature
• Knowledge Graphs for specific domains
• Databases of experimental results
• Decision support and question answering systems
This kind of information extraction can be a key
component of realizing the library community’s vision for
linked data in cultural heritage & scholarly research.
34
35. A tour of analytics
September 12, 2018
Analytics Beyond Usage Numbers
• From library analytics and metrics,
• To metadata analytics,
• To information analytics and knowledge graph
extraction
• Heterogenous data streams:
- Combined in interesting ways,
- Made queryable and recombinant,
- For use in question answering, visualization, and more.
35
36. 36
September 12, 2018
Analytics Beyond Usage Numbers
36
• Heterogeneous storage
• Databases
• Graphs
• Columnar data formats
• Cloud object storage
• Heterogeneous tools and systems
• Spark and Kafka
• Tableau, D3, Seaborn
• Notebooks (Jupyter, Databricks)
Design guidance
https://dataintensive.net/
37. Thank you
Corey A Harper
Sr. Technology Researcher
Elsevier Labs
c.harper@elsevier.com
@chrpr
41. Library analytics
September 12, 2018
Analytics Beyond Usage Numbers
• Data informed decision making
• Use cases around:
- Library instruction
- Personalized recommendations
- E-resource cost per use
- Physical collections & space
- Digital collections
• Tying library programs to student
GPA
• Building personas from data
• Service point staffing and use
41
All of this requires
• Data collection and integration:
- University data warehouses
- Library systems
- Subscription vendors
• Data management policies
• Data analysis tools and expertise
44. Answers are about things, not just Works
September 12, 2018
Analytics Beyond Usage Numbers
44
Why shouldn’t a search on an author
return information about the author,
including the author’s works? Where
was the author born, when did she
live, what is she known for? … All of
this is possible, but only if we can
make some fundamental changes in
our approach to bibliographic
description. ... The challenge for us
lies in transforming what we can of our
data into interrelated “things” without
overindulging that metaphor.
Coyle, K. (2016). FRBR, before and after: a look at our
bibliographical models. Chicago: ALA Editions.
45. Building Knowledge Graphs
September 12, 2018
Analytics Beyond Usage Numbers
45
Plus LAWDI, LOD-LAM, LD4L-Labs, & Many More
https://zepheira.com/ – https://linkedjazz.org/network/ – http://vivo.cornell.edu/
Script:
We’ve been working on combining our vast quantities of structured data with technology, supported by our operational expertise.
You may know us for content from articles and books, but for example, we hold 500m experimental facts in our chemistry databases.
We collect 13M user queries on ScienceDirect every month. Elsevier is sitting a top a trove of “big-data”.
And we’ve built the technology muscle to process that data. We employ over 1000 technologists.
We’re using artificial intelligence such as machine reading and machine learning. As an example of scale on ScienceDirect, we employ collaborative filtering, analyzing 1bn articles from 2.5m researchers daily, to generate 250m article recommendations.