Data availability and feasibility of validation – A genomics case study
1. Data availability and
feasibility of validation –
A genomics case study
Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma
Stuart, Meiko Makita, Verena Weigert, Chris Keene,
Nushrat Khan, Katie Drax, Kayvan Kousha
University of Wolverhampton, University of Bristol & UK
Reproducibility Network & JISC
2. Data sharing experiment goals
• Find out how often data is shared in a field with
apparently ideal conditions
• Write a program to automatically identify shared
data of a specified type
• Write a program to validate the quality of shared
data of a specified type
• As a step towards more general automatic shared
data discovery and quality control
3. The ideal case study topic? GWAS
• Genome Wide Association Study (GWAS) summary
statistics
• Variation likelihood at large sets of locations of the human
genome for measurable traits (e.g. disease susceptibility)
• Data is high value and expensive to collect
• Often stored in a standard format for internal sharing
by consortia
• An international repository exists for hosting it,
emphasising its importance
• NHGRI-EBI Catalog of published genome-wide association
studies
• Meta-analyses benefit from shared files – increased
power and population triangulation
• Genomics has a reputation for data sharing
5. Methods
• Medline search for articles that could be primary
human GWAS
"Molecular Epidemiology"[Majr] AND "Genome-
Wide Association Study"[Majr]
• Restriction to 2010 and 2017 to identify trends
• Three human coders classified 1799 articles for
being (a) primary human GWAS and (b) publicly
sharing complete primary human GWAS summary
statistics
• MT and MM follow-up checks of results
https://www.biorxiv.org/content/10.1101/622795v1
6. Results
Data availability information 2010 2017 Total Percent
GWAS location not stated in article 156 139 295 89.4%
Broken link or not findable at stated location 3 1 4 1.2%
On request to the authors 0 8 8 2.4%
On request via dbGaP 2 5 7 2.1%
On request via EGA 1 3 4 1.2%
On request via another portal 0 3 3 0.9%
Free online without login, proprietary format 1 0 1 0.3%
Free online without login, plain text 0 8 8 2.4%
10.6% reported sharing GWAS summary statistics in some form
7. Article descriptions of the availability
of GWAS summary statistics
• Usually in a Data Availability article section (26 out of
35).
• Data availability more difficult to identify from the
methods (4 articles) and results (3 articles).
• Only five data sharing statements described the shared
data as GWAS summary statistics, and all five used
different phrases
• “full GWAS summary statistics”, “Case Oncoarray GWAS data”,
“Summary GWAS estimates”, “Summary statistics for the
genome-wide association study”, “genome-wide set of
summary association statistics”
• Descriptions are therefore hard to automatically
identify from articles.
8. Conclusions
• Data sharing is unlikely to become near-universal
when it is optional.
• Policy initiatives or mandates are needed to
promote data sharing.
• Automatically identifying shared data is difficult or
impossible in practice because of:
• the complexity of articles (multiple data sources and
article structures)
• a lack of standardisation of terminology
• - but data availability statements help
Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma Stuart, Meiko Makita, Verena
Weigert, Chris Keene, Nushrat Khan, Katie Drax, Kayvan Kousha
University of Wolverhampton, University of Bristol & UK Reproducibility Network & JISC
9. Follow-up study: Investigating
data availability statements
• A program was written to extract data sharing
statements from full text articles in XML
• Free software Webometric Analyst
(http://lexiurl.wlv.ac.uk/), menu: Citations > PMC full
text > Data availability statements extract
• Manual content analysis for types of information in
extracted PMC Open Access Subset data availability
statements (n=500)
• Test machine learning for classifying data sharing
methods from data availability statements
10. Result - how is data shared?
Almost all papers with D.S.S. claim
to share data.
Standardised wordings common
e.g., “All relevant data are within
the paper.”
11. Results – what data is shared?
38% of data sharing
statements specify that all
data is shared
12. Results – why is data [not] shared?
91% of data sharing
statements give no
explanation for their
data sharing policy
13. Machine learning
• Simple support vector machines (SVM) test for
detecting sharing methods from data sharing
statements
• 87% accurate for: How is the data shared
• 89% accurate for: is all the data shared (binary)
14. Software to detect data sharing
• Webometric Analyst (free: http://lexiurl.wlv.ac.uk/)
tool to extract data sharing statements from a
folder of PDFs and classify them
• http://lexiurl.wlv.ac.uk/searcher/datashare.html
• Needs standard format for these statements
• Disciplinary & publisher differences in the uptake of data
sharing statements
16. Summary
• Data sharing seems to need mandates to become
widespread, even in otherwise best case fields
• Shared data is hard to detect precisely because of
article complexity and language variation.
• Basic information about whether data is shared and
where can be extracted automatically from data
availability statements.
• Applications: Monitoring; More useful in the longer
term after standardisation?
• Mike Thelwall, Marcus Munafò, Amalia Mas Bleda,
Emma Stuart, Meiko Makita, Verena Weigert, Chris
Keene, Nushrat Khan, Katie Drax, Kayvan Kousha
• University of Wolverhampton, University of Bristol
& UK Reproducibility Network & JISC
Editor's Notes
“A single-nucleotide polymorphism, often abbreviated to SNP, is a substitution of a single nucleotide that occurs at a specific position in the genome, where each variation is present to some appreciable degree within a population (e.g. > 1%).” https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism