Why would a publisher care about open data?

Why would a publisher care
about open data?
Anita de Waard
November 2019

Why would a publisher care about open data?
What do we mean by open?
What do we mean by data?
What do we mean by a publisher?

data
Data, after all, is stuff machines can handle […]
we could create a world in which it would be programs
-- not just people -- that would enjoy the data.
For data, as for documents, the value of any part of the web is
increased by the amount of other stuff out there.
For documents it is the ability to follow links,
but for open data it is the ability to also interconnect and join,
to summarise and compare, to monitor, extrapolate, to infer.
Tim Berners-Lee, 2009
NOW!
• Provenance of data: STAR Methods at Cell
• Contributor Roles (CRediT) taxonomy
• Citation and linking to data and software
• Versioned linking to data & software
REAGENT/RESOURCE SOURCE IDENTIFIER
Antibodies
Rabbit monoclonal anti-
Snail
Cell Signaling
Technology
Cat#3879S; RRID:
AB_2255011
Mouse monoclonal anti-
Tubulin (clone DM1A)
Sigma-Aldrich Cat#T9026; RRID:
AB_477593
Rabbit polyclonal anti-
BMAL1
This paper N/A
Bacterial and Virus Strains
pAAV-hSyn-DIO-
hM3D(Gq)-mCherry
Krashes et al.,
2011
Addgene AAV5;
44361-AAV5
AAV5-EF1a-DIO-
hChR2(H134R)-EYFP
Hope Center Viral
Vectors Core
N/A
Cowpox virus Brighton
Red
BEI Resources NR-88
Zika-SMGC-1,
GENBANK: KX266255
Isolated from
patient (Wa 2016)
N/A
Staphylococcus aureus ATCC ATCC 29213
Streptococcus pyogenes:
M1 serotype strain: strain
SF370; M1 GAS
ATCC ATCC 700294
Biological Samples
Healthy adult BA9 brain
tissue
University of
Maryland Brain &
Tissue Bank
Cat#UMB1455

19.11.2019
Elsevier Data Solutions for Research
open
Scholix: A Linked Open Data Hub
to connect papers and datasets
Research Object Composer:
An Open source editor for
Research Objects

a publisher
What does a publisher even do anymore?
cites
20081977
newexisting
Example 1: Human papilloma virus causes cervical cancer

What does a publisher even do anymore?
Example 2: Top 20 universities in Quantum Computing

7
7
Author
Editor/
Publishers
Reader/
User
Researcher
Data Results Article UI
article
article
article
article
tool
tool
data
user
user
tool
data
article
article
tool
tool
data
data
data
datauser
user
user
article
Model: Castle
• Goal: selling content
• Metrics: number of units sold
• Strategy: optimize content delivery to users
Model: Marketplace
• Goal: grow number of interactions
• Metrics: number of interactions between users
• Strategy: optimize number of network interactions
Today:
linear supply chains
Linear supply chains are evolving into complex,
dynamic and connected value webs
Win by reputation Win by trust
Why publishers care about open science:
The future:
networked open science

19.11.2019
Extra Slides:
1. Elsevier in numbers
2. Research Data Management
3. Research Object Composer
4. Entellect and Life Science Solutions
5. Data analytics: Quantum Computing
6. Elsevier and Open Science

19.11.2019
1. Elsevier by the numbers

Elsevier by the numbers
25,000
Our products are used at
more than 25,000 Academic
and Government institutes
globally
14+ m
people a month use Science
Direct, our flagship platform
for academic research
320+
Reaxys®'s ML capability enables the
chemistry of drug discovery, and
materials innnovation for over 320
pharma innovators, 130 chemical
companies, and over 1100
7,500
Elsevier has 7,500
employees and serves
customers in over 180
countries.
430,000
Elsevier publishes 430,000
peer-reviewed articles
annually
9 m
Mendeley is a scientific social media
platform that enables around 9
million users worldwide, to organize,
write, collaborate and promote their

19.11.2019
2. Research Data Management

19.11.2019
DisseminateAnalyzeCollaborateControlStoreCreate & Collect
Collect
Create
Extract
Store
Secure
Manage
Control
Workspaces
Researchers
Data sets
Search
Integrate
Analyze
Share
Publish
Archive
EntellectTM
MACRO EDC
Hivebench GDPR

19.11.2019
How we deliver
1. Open system: through open
APIs, modules can be
integrations with other RDM tools
2. Data remains private at or
owned by institution
3. System is integrated with the
researcher workflows, to ensure
simple and clear use
4. Researchers continue to work
the same way, avoiding
additional bureaucracy and
administration

19.11.2019
Data Search
Retrieve active data, discover public data
Discover data
• 10 million+ datasets indexed from more than
35 repositories
• Deep indexing of data significantly enhances
the relevancy of results
• Keyword search within data files
• Filter search results by specific author,
institution, journal, subject category
Retrieve active data*
• Navigate to locally held institutional data
• Powerful keyword search and filtering

19.11.2019
Data Manager
Researchers can
• Share data privately within a research project
• Invite external collaborators to join a project
• Gather research data from data sources as it’s
generated (including ELNs)
• Annotate research data with detailed, subject-
specific metadata
• Curate data according to project or institutional
workflows
• Prepare to publish data on a repository of your
choice
• Open APIs allow tailored upload forms, automated
workflows, analyze and re-upload data files
Go from raw files to active datasets

19.11.2019
Data Repository
Researchers can
• Store up to 100 GB of data per
dataset in many formats
• Describe how experiments can be
reproduced
• Keep track of dataset versions
• Create DOI
for citation
(or university prefix)
Store datasets in a secure and trusted repository

19.11.2019
Data Monitor
Institutions can
• Keep track of data inside
and outside institution
• Achieve credibility,
visibility and integrity of
key research outputs
• Maintain visibility of
events in RDM space
• Improve researcher's
adoption of data sharing
tools
• Communicate value of
data sharing to
researchers during the
research process
Encourage and monitor compliance

Five Facts about Elsevier and Research Data
Fact #1 Elsevier’s Mendeley Data supports the entire lifecycle of research data
The 5 modules that make up Mendeley Data are specifically designed to utilize data
to its fullest potential, simplifying and enhancing current way of working.
Fact #3 Mendeley Data is an open system
It is a flexible platform — modules are designed to be used together, standalone, or
combined with other Elsevier and non-Elsevier solutions
Fact #2 Researchers and institutions own and control all the data
Mendeley Data allows researchers to keep data private, or publish it under one of
16 open data licenses, so they stay in full control
Fact #4 Mendeley Data can increase the exposure and impact of research
Mendeley Data Search indexes over 10 million datasets from more than 35
repositories
Fact #5 Elsevier is an active participant in the open data community
Elsevier partners with the open data community, and is currently working on
more than 20 projects globally

19.11.2019
Mendeley Data already integrates through open APIs with the global Research Data
Management ecosystem, as well as other Elsevier solutions
+ 35 repositories
(BePress planned)
• Mendeley Data Repository
datasets are automatically
synced with the Pure
curation workflow
• Projects, grants,
equipment, showcase
on portal (planned)
• Mendeley Data Search results
are visible on Scopus
• Notify new articles to Monitor
for data sharing compliance
• Datasets appear as records
on Scopus (planned)
• Mendeley Data usage is
accessible through Plum API
and widget
• Plumx metrics (citations,
usage, social mentions) are
captured and shown on
Mendeley Data Repository
Publish datasets
alongside an article
on Mendeley Data
within the SSRN
publication flow
Publish or link datasets
alongside an article on
Mendeley Data within the
ScienceDirect publication flow
Researcher and
Institutional
Dataset metrics
• User identity & login
• Library (planned)
• Notes (planned)
• Projects (planned)
Existing integration
Planned integration
• Mendeley Data indexed
by OpenAIRE index
• OpenAire Zenodo
repository indexed by
Mendeley Data Search
Long-term
preservation of
published datasets
Links between articles and datasets:
• Contributed by Mendeley
Data to Scholix
• Indexed by Menndeley Data
Search and Data Monitor
• Consumed by Scopus and
ScienceDirect
Integrate with machine
readabledata management plans
• For more than 35 repositories the
metadata as well as the underlying
datasets are indexed by Mendeley
Data Search
• First repositories are actively
integrating with the free and open
‘push API’ of Mendeley Data
Search
• Mint DOIs for Mendeley Data
Repository
• Data Cite indexed by
Mendeley Data Search

19.11.2019
3. Research Object Composer

Building an open interoperable data ecosystem:
Aggregates
link things together
Annotations
about things & their
relationships
Container
Packaging content & links:
Zip files, BagIt, Docker images
Identification
locate things
regardless where
21

Building an open interoperable data ecosystem:
database
Open
repository
Workflow Tool
Task 1
Workflow
Input
Task 2
Task 3
Output
Research Object Composer
http://www.researchobject.org
Research Object Profiler
Add annotation and
relationships (metadata)
to collection to describe a
research object:
- URI
- Length
- Filename
- Checksums
etc.
Research Object Serializer
(a manifest itemizing file names)
Serialise Research Object
in standard format based BagIt
=1
=2
=3
RO
1
2
3
Open API
22
Mendeley Data
RO
1
2
3
• DOIs
• Metadata
(Findability)
• Open repo
(Accessibility)
• Versioning
• RO Standard
(Interoperability,
Reusability)

• The RO Composer is not a registry of research objects, but it can list research objects currently under construction.
• The RO Composer is a microservice which responsibility is to help other services create and deposit research objects.
• The composer acts as a temporary construction site that can be completed by multiple services (e.g. a data management
system, a workflow system, a user interface).
• These clients will be jointly building a Research Object
that can then be validated according to the schema,
before the RO is downloaded or deposited into an archive
(like Zenodo or Mendeley Data).
• Clients of the RO Composer are applications
(driven by a user interface) or agents (engaged
automatically from other events, e.g. a workflow run).
• The RO Composer is not a required component to this:
any software may generate research objects by following
Research Object specifications.
Purpose of the Research Object Composer*:
23* From: https://github.com/ResearchObject/research-object-composer/blob/master/introduction.ipynb

• API: https://researchobject.github.io/research-object-composer/api/
• Source: https://github.com/ResearchObject/research-object-composer
• Link to Jupyter Notebook tutorial (even I can do it!)
You can drive it today!
24

19.11.2019
4. Entellect and Life Science Solutions

19.11.2019

27
Human Papilloma Virus and Cervical Cancer
2008
zur Hausen awarded
Nobel Prize
1976
zur Hausen
proposes link
between HPV and
Cervical Cancer
1946
Papanicolau
develops PAP
smear
2006
Gardasil HPV
vaccine approved
Study impact of intervening
research in this talk

28
Early Work
1977
“a hypothesis has been presented that the virus
found in genital warts may be involved in the etiology
of human genital cancer”

30
Citation Mapping Process
19.11.2019
Build corpus of papers using broad search (~20,000 papers) on all aspects of cervical
cancer and HPV
Expand corpus by adding all cited works not in the original corpus
Add cited works from the cited corpus (“grandchild” references )
Connect the discrete steps of scientific advances connecting the works
Apply graph mathematics to find all connected paths

31
Assembling The Graph
19.11.2019
• Dense interconnected web of
cititations
• Filter for only cited works within 3
years of the citing work – building
on relevant knowledge
First level Second level
Recognize
identities in
graph
Corpus

32
Building the Corpus
19.11.2019
'papillomaviridae' AND 'cancer' AND [article]/lim - 2,747 results from 1975-2019
• 55,414 references total cited in this set
• 29,064 unique references (the references overlap) 1870-2019
• 719,470 references cited in this set of 29,064 papers
• 259,908 unique in this set.
Total corpus of work using this method is 182,402 unique articles
• Citation network has 103,443 edges

33
Path Finding
19.11.2019
Select “interesting” endpoints
• Significant starting point – proposal that HPV could be related to cancer
• Significant endpoint – recognition of HPV/cancer connection
Use graph traversal analytics to find all paths greater than 5 papers that connect the two
ideas
Separate by year

34
Example Pathway Linking Idea to Vaccine 17 links. 30 years.
19.11.2019

35
Resulting Graph
19.11.2019
Represents the
incremental advances
by year from concept
to acceptance
20081977
New
cites
existing

19.11.2019
5. Data Analytics:
Quantum Computing

Quantum Computing Research: Highest FWCI Non-US

Quantum Computing Research: Highest FWCI US

Quantum Computing Research Worldwide--FWCI

Top 20 universities active in Quantum Computing
University of
Waterloo
National University
of Singapore
Massachusetts
Institute of
TechnologyUniversity of Science
and Technology of
China
University of Oxford
Tsinghua University
University of Tokyo
Harvard University
University of
Maryland
University of New
South Wales
University of
California at Santa
Barbara
ETH Zurich
University of Sydney
RAS
University of
Southern California
Perimeter Institute
for Theoretical
Physics
University College
London
Princeton University
University of
Michigan
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 50 100 150 200 250
FWCI
Publications

Quantum Computing Research in Top 10% Citation Percentile, US vs. Non-US

Quantum Computing Research Worldwide--Academic-Corporate Collaboration

Academic-Corporate Collaboration—Network Map of Top 20 universities in Quantum Computing

Quantum Computing: Academic-Corporate Collaboration and Patent Citations per
Scholarly Output, US vs. Non-US

Quantum Optics/Flux Qubits: Academic-Corporate Collaboration and Patent
Citations per Scholarly Output, US vs. Non-US

Quantum Optics/Flux Qubits Research by Country

Quantum Optics/Flux Qubits—Top Corporates

Quantum Optics/Flux Qubits—Top Universities

Quantum Optics/Flux Qubits—Keyphrase-based Analysis

19.11.2019
6. Elsevier and Open Science

ELSEVIER I Elsevier Open Science: Creating value through collaboration I
CONFIDENTIAL
55
Global market dynamics and technologies are reconfiguring the academic ecosystem:
Macroeconomic developments
Ecological and societal sustainability
• Global population is growing; 9B people in 2050
• Challenge to produce more with less and cleaner
input
• Challenge to solve poverty and unequal
allocation of resources
Shifting power balance from West to East
• Strong economic growth in China and India
• Rise of the middle class; improvement of
educational and health care system and food
supply chain
Technological developments
The web
• Everyone is a publisher
• Content access is ubiquitous
The social web
• Professional and personal networks emerge
without traditional institutions
• Everyone is a peer reviewer
Big data
• Explosion of data through networking of
measurement tools
• Radically cheaper tools and computing
power
Social developments
• Pressure from society and funders to
justify the costs of science
• Need for reliable research results (that can
be trusted).
• Patients/citizens demand access.
increased participation
• Distributed computing makes it easier to
make and share tools, content and code
• Overall need for more transparency and
accountability, also in doing and reporting
research
Emergence of open
science
Open Peer Review
New social networks
Data, tools and workflows are sharedOpen Data
Society is engaging moreOpen API’s Open Source Software

CONFIDENTIAL
Carl Kesselman builds tools to enable
neuroscientists to store and share their data
in a better way
Viktor Pankratius builds software programs
that generate hypotheses about volcano
eruptions: the software can steer drones to
collect data.
Lena Deus solves scientific problems
through Kraggle: the system awards her
points for scoring highest on Machine
Learning tasks.
Scientists build data sharing
tools Computers are scientists
Science becomes a game,
which anyone can play
Some examples of Open Science:

CONFIDENTIAL
57
Moving to a network of connected components:
Take an Open Source data repository and find some Open Data:1
Deriva, an Open
Source data
repository
2
Write some Open Source
software to mash them up:
3 Share outputs as
OA/OD/OS:
Share new data
sets on data
Deriva
Publish
papers in an
OA journal
Share code on
platforms like
Github
user
A
1
Community adds
elements to open
science platforms that
can be used by
everyone.
2
Researchers build upon
the combination of
shared content/system
elements. This leads to
new scientific knowledge
and output.
All sharable elements find
their way to other open
platforms and formats and
can be re-used, causing a
network effect.
3
Networked system:
PLATFOR
M A
Data v1
user
B
PLATFOR
M BTools B
Open Research Platform
Data v2
Tools Carticle
user
C
Open Data
Repositorie
s
Open
Access
Journals
Code
Networks
Neuroscience data
Jupyter Notebook to calculate
properties
Share code on
platforms like
Github

CONFIDENTIAL
58
Manu-
facturers
Distri-
butors
Consu-
mers
Suppliers
data
tool article user
article
article
article
article
tool
tool
data
user
user
tool
data
article
article
tool
tool
data
data
data
datauser
user
user
article
Open Science represents a transition from a pipeline to a networked knowledge system:
Model: Castle
• Goal: selling content
• Metrics: number of units sold
• Strategy: optimize content delivery to users
Model: Marketplace
• Goal: grow number of interactions
• Metrics: number of interactions between users
• Strategy: optimize number of network interactions
Today:
linear supply chains
The future:
networked open
science
Linear supply chains are evolving into complex, dynamic and connected value
webs
Win by reputation Win by trust

CONFIDENTIAL
59
Some current Open Science efforts:
Open
Access
Open
Data
Open
Metrics
Research
Integrity
&
Reproduci
bility
Science
&
Society
Open Tools and Software
Open Science
Open Access:
- Hybrid/Gold journals, open/self-
archive options
- Contributing to CHORUS,
CrossMark, RA21
- ‘Platinum OA’ on bepress Digital
Commons
- Pilot SSRN Preprint of the Lancet
.
Research Integrity and Reproducibility:
Many efforts, including:
- Full GDPR Compliance across all Elsevier products
- Preregistration and Registered Reports
- STAR Methods for Cell, transparent reporting
- Plagiarism and Image manipulation detection
- Statistics checking
- Reproducibility badges/TOP guidelines
- Transparency in contributorship roles (CRediT
Taxonomy)
- Research collaborations e.g Humboldt, Data Integrity
Science and Society:
- Science Literacy effort: Topic Pages,
Audioslides, Science and People
- Access to content via Patient Inform,
Research4life, Bookshare and Load2Learn.
- Elsevier Foundation supporting many
projects including Green and Sustainable
Chemistry, awards for early-career women
scientists from developing world, many
more
Open Data:
- All data is open on all platforms
- Following TOP guidelines across board
- Coleads on Enabling FAIR Data requiring
data deposits in Earth & Space Science
- Coleads Data Citation Principles in
Force11
- Supporting Scholix Linked Data repository
and other open data standards, efforts
through RDA, ORCID, CrossRef, etc
Open Metrics:
- CiteScore free API
- PlumX metrics and NewsFlo: free layer of
societal impact metrics on article level
- Helping lead RDA Make Data Count effort
with CDL/Datacite to establish data
metrics
Open Tools and Software:
- Open APIs for most products
- Many research collaborations leading to Open Source
software, e.g. Github4Labs, NIH Data commons
- Hackathons, in medicine <Elsevier Hacks>, for Mendeley
- Content and data available for research and development
and hackathons

Why would a publisher care about open data?

More Related Content

What's hot

Similar to Why would a publisher care about open data?

More from Anita de Waard

Recently uploaded

Why would a publisher care about open data?

Editor's Notes