Measuring richness. A RCT to quantify the benefits of metadata quality; Scott Edmunds

Measuring richness. A RCT to
quantify the benefits of
metadata quality
Scott Edmunds
DataCite APAC 2020

8years
in numbers
765
papers
published
5,411authors
from
78 countries
1,575
institutions
We’ve published
46 TB of data
1,914
327,658files
5editors in
locations
across5
16time
zones
68+ years of editorial
experience
3data curators with
years of experience65+
20+
patents
and policy
documents
1,000+news articles
and blogs
Our contents is used
in
21,186+tweets
PROSE
Award1
innovation in
journal publishing
multidisciplinary
231
Data
Notes
&
datasets/
DataCite DOIs
Incentivising data sharing through data publication

DataCite meeting 2012, revisited
https://youtu.be/nzHM1BLYj0g

Where data citation was in 2012:
1. Proven utility/potential user base.
2. Acceptance/inclusion by journals.
3. Data+Citation: inclusion in the references.
4. Tracking by citation indexes.
5. Usage of the metrics by the community…
✔
✔
✔
✗
✗

We still need to tell people to #CitetheDOI
Where data citation is in 2020?

Gioiosa S, Bolis M, Flati T, Massini A, Garattini E, Chillemi G, Fratelli M, Castrignanò T. Massive NGS
data analysis reveals hundreds of potential novel gene fusions in human cell lines. Gigascience. 2018
Oct 1;7(10). doi: 10.1093/gigascience/giy062.
https://doi.org/10.1093/gigascience/giy062
What we didn’t know in 2012: #DataCitationFail
e.g. this Paper:
Includes no DOI information in crossref metadata (stripped?):
Gioiosa S, Bolis M, Flati T, et al. Supporting data for "massive NGS data
analysis reveals hundreds of potential novel gene fusions in human cell
lines.” GigaScience Database . 2018; http://dx.doi.org/10.5524/100442
Cites this GigaDB dataset DOI in the references:
See Ref 37: https://api.crossref.org/v1/works/doi.org/10.1093/gigascience/giy062
No Citations then show up in EventData:
https://api.datacite.org/events?doi=10.5524/100442

Where metadata is in 2020?
1. Focus now on move from open to FAIR data
(metadata for reusability)
2. Data journals helping incentivize best practice
3. Google dataset search pushing value of
schema.org (structured metadata for
discoverability)
4. Event data & scholix pushing value of non-
proprietary (DataCite/CrossRef) citation data
5. New indexes, knowledge graphs and tools built
upon these richer data sources

Huge potential but are data producers
using/following it?
http://www.metadata2020.org/

GigaScience: adding value (work)
Minimal DataCite:
Title
Author names
Publisher details
Release date
Resource type
Language
Additional DataCite:
ORCID IDs
Keywords
Funder details
Size of dataset
License
Description
Relationship info
Dataset specific:
Reporting checklist attributes
Location
Specimen details
Phenotypic info
Related accessions
Discoverability Reusability
+
(Discoverability)

Is this worth the effort?
Follow the medical community approach: Randomized Control Trial?
Pyramid of evidence?
RCTs
Cohort studies
Case-control studies
Case reports, qualitative research

• 1st Phase, proof of concept for 10KP
• The 1st digitalized botanical garden
• Show the biodiversity and phyletic evolution and
interactions between environment, ecosystem, and
evolution
• HT species identification & build CNGB Herbarium
• Results of phase 1 published in GigaScience
1,093 Samples
1093 Voucher
Specimen
49 Order
137 Family
761 Deep-
sequenced
689
Vascular
Species
54TB Data
DRBG
“Digitization of Ruili Botanical Garden”
Finding an example to study
1093 specimens & 54TB of data
Ruili Botanical
Garden

Top level DOI
http://dx.doi.org/10.5524/100502

Individual specimen DOIs
http://dx.doi.org/10.5524/101701
Imaging files
Chloroplast sequence
Link to NCBI bioproject/raw
data in SRA
Sequencing + imaging data

Rich metadata includes
http://dx.doi.org/10.5524/101294
• GSC compliant sample attributes
• Geographic location/restricted access
• Environment (ENV Ontology)
• Herbarium Voucher number
• Phenotypic info (e.g. height)
• Related NCBI accessions
• Genome size & seq volume/coverage
DataCite Metadata (discoverability)
GigaDB Metadata (reusability)
• Authorship/ORCID details
• Relationship to other datasets
• License
• Title/abstract/date
• Keywords
+ schema.org Metadata (discoverability)

Does rich metadata increase discoverability? Testing with RCT
https://osf.io/wzps8/

https://osf.io/wzps8/
HDC1 – High data content, full
DataCite metadata, n=8
LDC1 – Low data content, full
HDC2 – High data content, minimal
LDC2 – Low data content, minimal
High data content (HDC set, n=17)
Low data content (LDC set, n=1076)
=RANDBETWEEN
COHORT
Rich metadata set Poor metadata set
Wait 12 months
Any difference in metrics? (visits, downloads, citations…)

Any difference in metrics? (visits, downloads, citations…)
• The total number of unique page views for ALL 1093 Ruili
individual datasets is 504 over the year (0.46 views per
dataset)
• Equivalent datasets (individual genomes for bird & orphan
crop genome projects) that are NOT Ruili datasets over the
same period received 4473 unique page hits (44.7/dataset)
• Rich metadata datasets received on average 0.438
hits/dataset/year
• Poor metadata datasets received on average 0.485
hits/dataset/year
✗FAIL Didn’t work/underpowered (very low access stats)

✗FAIL Lessons learned for future RCTs
• Unidentified species not a great use case for discoverability
• Quick and dirty approach to RCT doesn't work. Need a wider spectrum of more
popular datasets and a bigger sample size
• Trying to compare historical usage is tricky, need better matching of
comparison groups. Datasets ideally need to be released at the same time to
account for calendar differences and usage spikes
• Need to test with databases with higher accesses/turnover, which could do
this with hundreds of random generic datasets published within a short
timeframe, and randomly assigned into minimal vs enhanced metadata groups
• CrossRef RCTs would probably work better than DataCite (more users)

See our experiment https://osf.io/wzps8/
TO DO METADATA EXPERIMENTS

Thanks to:
Laurie Goodman, Publisher
Nicole Nogoy, Editor
Hans Zauner, Assistant Editor
Hongling Zhao, Assistant Editor
Peter Li, Head of IT
Chris Hunter, Lead BioCurator
Chris Armit, Data Scientist
Mary Ann Tulli, Data Editor
Rija Ménagé, Senior Software Engineer
Ken Cho, Systems Programmer Analyst
Chen Qi, Shenzhen Office.
Jesse Xiao (now at HKU)
Follow us:
https://gigabytejournal.com/
Submit to our new GigaByte Journal, free APCs till 28th Feb 2021
@GigaByteJournal
facebook.com/GigaByteJournal
http://gigasciencejournal.com/blog/
editorial@gigabytejournal.com

Measuring richness. A RCT to quantify the benefits of metadata quality; Scott Edmunds

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Measuring richness. A RCT to quantify the benefits of metadata quality; Scott Edmunds

Similar to Measuring richness. A RCT to quantify the benefits of metadata quality; Scott Edmunds (20)

More from GigaScience, BGI Hong Kong

More from GigaScience, BGI Hong Kong (20)

Recently uploaded

Recently uploaded (20)

Measuring richness. A RCT to quantify the benefits of metadata quality; Scott Edmunds

Editor's Notes