Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data availability Study

45 views

Published on

Primary data collected during a research study is increasingly shared and may be re-used for new research. The aim of this project was to assess the extent of data sharing of summary statistics of primary human genome-wide association studies (GWAS) as an example of data sharing in favourable circumstances in a particular discipline and whether such checks can be automated. This presentation will summarise the findings of the project and demonstrate a tool to extract information from data availability statements

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Data availability Study

  1. 1. Data availability and feasibility of validation – A genomics case study Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma Stuart, Meiko Makita, Verena Weigert, Chris Keene, Nushrat Khan, Katie Drax, Kayvan Kousha University of Wolverhampton, University of Bristol & UK Reproducibility Network & JISC
  2. 2. Data sharing experiment goals • Find out how often data is shared in a field with apparently ideal conditions • Write a program to automatically identify shared data of a specified type • Write a program to validate the quality of shared data of a specified type • As a step towards more general automatic shared data discovery and quality control
  3. 3. The ideal case study topic? GWAS • Genome Wide Association Study (GWAS) summary statistics • Variation likelihood at large sets of locations of the human genome for measurable traits (e.g. disease susceptibility) • Data is high value and expensive to collect • Often stored in a standard format for internal sharing by consortia • An international repository exists for hosting it, emphasising its importance • NHGRI-EBI Catalog of published genome-wide association studies • Meta-analyses benefit from shared files – increased power and population triangulation • Genomics has a reputation for data sharing
  4. 4. https://www.ebi.ac.uk/gwas/diagram Each dot represents a point on the human genome that at least one research study has found to associate with a measurable trait
  5. 5. Methods • Medline search for articles that could be primary human GWAS "Molecular Epidemiology"[Majr] AND "Genome- Wide Association Study"[Majr] • Restriction to 2010 and 2017 to identify trends • Three human coders classified 1799 articles for being (a) primary human GWAS and (b) publicly sharing complete primary human GWAS summary statistics • MT and MM follow-up checks of results https://www.biorxiv.org/content/10.1101/622795v1
  6. 6. Results Data availability information 2010 2017 Total Percent GWAS location not stated in article 156 139 295 89.4% Broken link or not findable at stated location 3 1 4 1.2% On request to the authors 0 8 8 2.4% On request via dbGaP 2 5 7 2.1% On request via EGA 1 3 4 1.2% On request via another portal 0 3 3 0.9% Free online without login, proprietary format 1 0 1 0.3% Free online without login, plain text 0 8 8 2.4% 10.6% reported sharing GWAS summary statistics in some form
  7. 7. Article descriptions of the availability of GWAS summary statistics • Usually in a Data Availability article section (26 out of 35). • Data availability more difficult to identify from the methods (4 articles) and results (3 articles). • Only five data sharing statements described the shared data as GWAS summary statistics, and all five used different phrases • “full GWAS summary statistics”, “Case Oncoarray GWAS data”, “Summary GWAS estimates”, “Summary statistics for the genome-wide association study”, “genome-wide set of summary association statistics” • Descriptions are therefore hard to automatically identify from articles.
  8. 8. Conclusions • Data sharing is unlikely to become near-universal when it is optional. • Policy initiatives or mandates are needed to promote data sharing. • Automatically identifying shared data is difficult or impossible in practice because of: • the complexity of articles (multiple data sources and article structures) • a lack of standardisation of terminology • - but data availability statements help Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma Stuart, Meiko Makita, Verena Weigert, Chris Keene, Nushrat Khan, Katie Drax, Kayvan Kousha University of Wolverhampton, University of Bristol & UK Reproducibility Network & JISC
  9. 9. Follow-up study: Investigating data availability statements • A program was written to extract data sharing statements from full text articles in XML • Free software Webometric Analyst (http://lexiurl.wlv.ac.uk/), menu: Citations > PMC full text > Data availability statements extract • Manual content analysis for types of information in extracted PMC Open Access Subset data availability statements (n=500) • Test machine learning for classifying data sharing methods from data availability statements
  10. 10. Result - how is data shared? Almost all papers with D.S.S. claim to share data. Standardised wordings common e.g., “All relevant data are within the paper.”
  11. 11. Results – what data is shared? 38% of data sharing statements specify that all data is shared
  12. 12. Results – why is data [not] shared? 91% of data sharing statements give no explanation for their data sharing policy
  13. 13. Machine learning • Simple support vector machines (SVM) test for detecting sharing methods from data sharing statements • 87% accurate for: How is the data shared • 89% accurate for: is all the data shared (binary)
  14. 14. Summary • Data sharing seems to need mandates to become widespread, even in otherwise best case fields • Shared data is hard to detect precisely because of article complexity and language variation. • Basic information about whether data is shared and where can be extracted automatically from data availability statements. • Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma Stuart, Meiko Makita, Verena Weigert, Chris Keene, Nushrat Khan, Katie Drax, Kayvan Kousha • University of Wolverhampton, University of Bristol & UK Reproducibility Network & JISC

×