Data availability and feasibility of validation – A genomics case study

V
Data availability and
feasibility of validation –
A genomics case study
Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma
Stuart, Meiko Makita, Verena Weigert, Chris Keene,
Nushrat Khan, Katie Drax, Kayvan Kousha
University of Wolverhampton, University of Bristol & UK
Reproducibility Network & JISC
Data sharing experiment goals
• Find out how often data is shared in a field with
apparently ideal conditions
• Write a program to automatically identify shared
data of a specified type
• Write a program to validate the quality of shared
data of a specified type
• As a step towards more general automatic shared
data discovery and quality control
The ideal case study topic? GWAS
• Genome Wide Association Study (GWAS) summary
statistics
• Variation likelihood at large sets of locations of the human
genome for measurable traits (e.g. disease susceptibility)
• Data is high value and expensive to collect
• Often stored in a standard format for internal sharing
by consortia
• An international repository exists for hosting it,
emphasising its importance
• NHGRI-EBI Catalog of published genome-wide association
studies
• Meta-analyses benefit from shared files – increased
power and population triangulation
• Genomics has a reputation for data sharing
https://www.ebi.ac.uk/gwas/diagram
Each dot represents a point on the human genome that at least one
research study has found to associate with a measurable trait
Methods
• Medline search for articles that could be primary
human GWAS
"Molecular Epidemiology"[Majr] AND "Genome-
Wide Association Study"[Majr]
• Restriction to 2010 and 2017 to identify trends
• Three human coders classified 1799 articles for
being (a) primary human GWAS and (b) publicly
sharing complete primary human GWAS summary
statistics
• MT and MM follow-up checks of results
https://www.biorxiv.org/content/10.1101/622795v1
Results
Data availability information 2010 2017 Total Percent
GWAS location not stated in article 156 139 295 89.4%
Broken link or not findable at stated location 3 1 4 1.2%
On request to the authors 0 8 8 2.4%
On request via dbGaP 2 5 7 2.1%
On request via EGA 1 3 4 1.2%
On request via another portal 0 3 3 0.9%
Free online without login, proprietary format 1 0 1 0.3%
Free online without login, plain text 0 8 8 2.4%
10.6% reported sharing GWAS summary statistics in some form
Article descriptions of the availability
of GWAS summary statistics
• Usually in a Data Availability article section (26 out of
35).
• Data availability more difficult to identify from the
methods (4 articles) and results (3 articles).
• Only five data sharing statements described the shared
data as GWAS summary statistics, and all five used
different phrases
• “full GWAS summary statistics”, “Case Oncoarray GWAS data”,
“Summary GWAS estimates”, “Summary statistics for the
genome-wide association study”, “genome-wide set of
summary association statistics”
• Descriptions are therefore hard to automatically
identify from articles.
Conclusions
• Data sharing is unlikely to become near-universal
when it is optional.
• Policy initiatives or mandates are needed to
promote data sharing.
• Automatically identifying shared data is difficult or
impossible in practice because of:
• the complexity of articles (multiple data sources and
article structures)
• a lack of standardisation of terminology
• - but data availability statements help
Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma Stuart, Meiko Makita, Verena
Weigert, Chris Keene, Nushrat Khan, Katie Drax, Kayvan Kousha
University of Wolverhampton, University of Bristol & UK Reproducibility Network & JISC
Follow-up study: Investigating
data availability statements
• A program was written to extract data sharing
statements from full text articles in XML
• Free software Webometric Analyst
(http://lexiurl.wlv.ac.uk/), menu: Citations > PMC full
text > Data availability statements extract
• Manual content analysis for types of information in
extracted PMC Open Access Subset data availability
statements (n=500)
• Test machine learning for classifying data sharing
methods from data availability statements
Result - how is data shared?
Almost all papers with D.S.S. claim
to share data.
Standardised wordings common
e.g., “All relevant data are within
the paper.”
Results – what data is shared?
38% of data sharing
statements specify that all
data is shared
Results – why is data [not] shared?
91% of data sharing
statements give no
explanation for their
data sharing policy
Machine learning
• Simple support vector machines (SVM) test for
detecting sharing methods from data sharing
statements
• 87% accurate for: How is the data shared
• 89% accurate for: is all the data shared (binary)
Software to detect data sharing
• Webometric Analyst (free: http://lexiurl.wlv.ac.uk/)
tool to extract data sharing statements from a
folder of PDFs and classify them
• http://lexiurl.wlv.ac.uk/searcher/datashare.html
• Needs standard format for these statements
• Disciplinary & publisher differences in the uptake of data
sharing statements
Webometric Analyst output
• Attempts to classify what is shared, how(where),
and why
Summary
• Data sharing seems to need mandates to become
widespread, even in otherwise best case fields
• Shared data is hard to detect precisely because of
article complexity and language variation.
• Basic information about whether data is shared and
where can be extracted automatically from data
availability statements.
• Applications: Monitoring; More useful in the longer
term after standardisation?
• Mike Thelwall, Marcus Munafò, Amalia Mas Bleda,
Emma Stuart, Meiko Makita, Verena Weigert, Chris
Keene, Nushrat Khan, Katie Drax, Kayvan Kousha
• University of Wolverhampton, University of Bristol
& UK Reproducibility Network & JISC
1 of 16

Recommended

Data availability Study by
Data availability Study Data availability Study
Data availability Study Verena139
182 views14 slides
Link Analysis of Life Sciences Linked Data by
Link Analysis of Life Sciences Linked DataLink Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked DataMichel Dumontier
1K views14 slides
Network-based machine learning approach for aggregating multi-modal data by
Network-based machine learning approach for aggregating multi-modal dataNetwork-based machine learning approach for aggregating multi-modal data
Network-based machine learning approach for aggregating multi-modal dataSOYEON KIM
291 views40 slides
A survey of heterogeneous information network analysis by
A survey of heterogeneous information network analysisA survey of heterogeneous information network analysis
A survey of heterogeneous information network analysisSOYEON KIM
1.4K views40 slides
Open Access as a Means to Produce High Quality Data by
Open Access as a Means to Produce High Quality DataOpen Access as a Means to Produce High Quality Data
Open Access as a Means to Produce High Quality DataCGIAR Research Program on Dryland Systems
826 views17 slides
Complex Systems Biology Informed Data Analysis and Machine Learning by
Complex Systems Biology Informed Data Analysis and Machine LearningComplex Systems Biology Informed Data Analysis and Machine Learning
Complex Systems Biology Informed Data Analysis and Machine LearningDmitry Grapov
1.7K views28 slides

More Related Content

What's hot

Metabolomic data analysis and visualization tools by
Metabolomic data analysis and visualization toolsMetabolomic data analysis and visualization tools
Metabolomic data analysis and visualization toolsDmitry Grapov
7.2K views18 slides
CINECA webinar slides: Making cohort data FAIR by
CINECA webinar slides: Making cohort data FAIRCINECA webinar slides: Making cohort data FAIR
CINECA webinar slides: Making cohort data FAIRCINECAProject
33 views31 slides
CINECA webinar slides: Open science through fair health data networks dream o... by
CINECA webinar slides: Open science through fair health data networks dream o...CINECA webinar slides: Open science through fair health data networks dream o...
CINECA webinar slides: Open science through fair health data networks dream o...CINECAProject
76 views50 slides
Doing research better: The role of meta‐data by
Doing research better: The role of meta‐dataDoing research better: The role of meta‐data
Doing research better: The role of meta‐dataGarethKnight
644 views22 slides
A snake, a planet, and a bear ditching spreadsheets for quick, reproducible r... by
A snake, a planet, and a bear ditching spreadsheets for quick, reproducible r...A snake, a planet, and a bear ditching spreadsheets for quick, reproducible r...
A snake, a planet, and a bear ditching spreadsheets for quick, reproducible r...NASIG
146 views1 slide
OpenTox - an open community and framework supporting predictive toxicology an... by
OpenTox - an open community and framework supporting predictive toxicology an...OpenTox - an open community and framework supporting predictive toxicology an...
OpenTox - an open community and framework supporting predictive toxicology an...Barry Hardy
1.7K views42 slides

What's hot(20)

Metabolomic data analysis and visualization tools by Dmitry Grapov
Metabolomic data analysis and visualization toolsMetabolomic data analysis and visualization tools
Metabolomic data analysis and visualization tools
Dmitry Grapov7.2K views
CINECA webinar slides: Making cohort data FAIR by CINECAProject
CINECA webinar slides: Making cohort data FAIRCINECA webinar slides: Making cohort data FAIR
CINECA webinar slides: Making cohort data FAIR
CINECAProject33 views
CINECA webinar slides: Open science through fair health data networks dream o... by CINECAProject
CINECA webinar slides: Open science through fair health data networks dream o...CINECA webinar slides: Open science through fair health data networks dream o...
CINECA webinar slides: Open science through fair health data networks dream o...
CINECAProject76 views
Doing research better: The role of meta‐data by GarethKnight
Doing research better: The role of meta‐dataDoing research better: The role of meta‐data
Doing research better: The role of meta‐data
GarethKnight644 views
A snake, a planet, and a bear ditching spreadsheets for quick, reproducible r... by NASIG
A snake, a planet, and a bear ditching spreadsheets for quick, reproducible r...A snake, a planet, and a bear ditching spreadsheets for quick, reproducible r...
A snake, a planet, and a bear ditching spreadsheets for quick, reproducible r...
NASIG146 views
OpenTox - an open community and framework supporting predictive toxicology an... by Barry Hardy
OpenTox - an open community and framework supporting predictive toxicology an...OpenTox - an open community and framework supporting predictive toxicology an...
OpenTox - an open community and framework supporting predictive toxicology an...
Barry Hardy1.7K views
Text Data Mining: Unlocking the hidden potential from scholarly content. by Emma Warren-Jones
Text Data Mining: Unlocking the hidden potential from scholarly content.Text Data Mining: Unlocking the hidden potential from scholarly content.
Text Data Mining: Unlocking the hidden potential from scholarly content.
Laurie Goodman at #aibsdata: Beyond Data Release Mandates - Helping Authors M... by GigaScience, BGI Hong Kong
Laurie Goodman at #aibsdata: Beyond Data Release Mandates - Helping Authors M...Laurie Goodman at #aibsdata: Beyond Data Release Mandates - Helping Authors M...
Laurie Goodman at #aibsdata: Beyond Data Release Mandates - Helping Authors M...
Data analysis workflows part 1 2015 by Dmitry Grapov
Data analysis workflows part 1 2015Data analysis workflows part 1 2015
Data analysis workflows part 1 2015
Dmitry Grapov1.6K views
Overlapping Experiments Infrastructure by Srihari Sriraman
Overlapping Experiments InfrastructureOverlapping Experiments Infrastructure
Overlapping Experiments Infrastructure
Srihari Sriraman1.9K views
Connecting Metabolomic Data with Context by Dmitry Grapov
Connecting Metabolomic Data with ContextConnecting Metabolomic Data with Context
Connecting Metabolomic Data with Context
Dmitry Grapov4.8K views
The Kaleidoscope of Impact: same data, different perspectives, constantly cha... by Kudos
The Kaleidoscope of Impact: same data, different perspectives, constantly cha...The Kaleidoscope of Impact: same data, different perspectives, constantly cha...
The Kaleidoscope of Impact: same data, different perspectives, constantly cha...
Kudos1.4K views
Prote-OMIC Data Analysis and Visualization by Dmitry Grapov
Prote-OMIC Data Analysis and VisualizationProte-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and Visualization
Dmitry Grapov5.6K views
Navigating the data management ecosystem - John Kratz by Digital Science
Navigating the data management ecosystem - John KratzNavigating the data management ecosystem - John Kratz
Navigating the data management ecosystem - John Kratz
Digital Science 1.1K views
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D... by AKSHAY BHAGAT
Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
AKSHAY BHAGAT437 views
Integrating research indicators for use in the repositories infrastructure by petrknoth
Integrating research indicators for use in the repositories infrastructure Integrating research indicators for use in the repositories infrastructure
Integrating research indicators for use in the repositories infrastructure
petrknoth189 views

Similar to Data availability and feasibility of validation – A genomics case study

KnetMiner Overview Oct 2017 by
KnetMiner Overview Oct 2017KnetMiner Overview Oct 2017
KnetMiner Overview Oct 2017Keywan Hassani-Pak
1.2K views46 slides
Investigating plant systems using data integration and network analysis by
Investigating plant systems using data integration and network analysisInvestigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysisCatherine Canevet
3K views96 slides
GWAS and DAS by
GWAS and DASGWAS and DAS
GWAS and DASVerena139
167 views26 slides
CINECA webinar slides: Modular and reproducible workflows for federated molec... by
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECAProject
9 views33 slides
CI4CC sustainability-panel by
CI4CC sustainability-panelCI4CC sustainability-panel
CI4CC sustainability-panelRavi Madduri
861 views28 slides
2015 GU-ICBI Poster (third printing) by
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)Michael Atkins
37 views1 slide

Similar to Data availability and feasibility of validation – A genomics case study(20)

Investigating plant systems using data integration and network analysis by Catherine Canevet
Investigating plant systems using data integration and network analysisInvestigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysis
GWAS and DAS by Verena139
GWAS and DASGWAS and DAS
GWAS and DAS
Verena139167 views
CINECA webinar slides: Modular and reproducible workflows for federated molec... by CINECAProject
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECAProject9 views
CI4CC sustainability-panel by Ravi Madduri
CI4CC sustainability-panelCI4CC sustainability-panel
CI4CC sustainability-panel
Ravi Madduri861 views
2015 GU-ICBI Poster (third printing) by Michael Atkins
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
Michael Atkins37 views
Being FAIR: FAIR data and model management SSBSS 2017 Summer School by Carole Goble
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Carole Goble978 views
FAIR Data Management and FAIR Data Sharing by Merce Crosas
FAIR Data Management and FAIR Data SharingFAIR Data Management and FAIR Data Sharing
FAIR Data Management and FAIR Data Sharing
Merce Crosas1.3K views
Building genomic data cyberinfrastructure with the online database software T... by mestato
Building genomic data cyberinfrastructure with the online database software T...Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...
mestato331 views
A FAIR Data Sharing Framework for Large-Scale Human Cancer Proteogenomics by Brett Tully
A FAIR Data Sharing Framework for Large-Scale Human Cancer ProteogenomicsA FAIR Data Sharing Framework for Large-Scale Human Cancer Proteogenomics
A FAIR Data Sharing Framework for Large-Scale Human Cancer Proteogenomics
Brett Tully202 views
Dataverse, Cloud Dataverse, and DataTags by Merce Crosas
Dataverse, Cloud Dataverse, and DataTagsDataverse, Cloud Dataverse, and DataTags
Dataverse, Cloud Dataverse, and DataTags
Merce Crosas754 views
NIH Data Science Special Interest Group by Yaffa Rubinstien
NIH Data Science Special Interest GroupNIH Data Science Special Interest Group
NIH Data Science Special Interest Group
Yaffa Rubinstien83 views
Fair sample and data access -David Van enckevort by Data Science NIH
Fair sample and data access -David Van enckevortFair sample and data access -David Van enckevort
Fair sample and data access -David Van enckevort
Data Science NIH71 views
David Van Enckevort - FAIR sample and data access by DataSciSIG
David Van Enckevort - FAIR sample and data access David Van Enckevort - FAIR sample and data access
David Van Enckevort - FAIR sample and data access
DataSciSIG65 views
The Simulacrum, a Synthetic Cancer Dataset by CongChen35
The Simulacrum, a Synthetic Cancer DatasetThe Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer Dataset
CongChen35599 views
Reproducible research: theory by C. Tobin Magle
Reproducible research: theoryReproducible research: theory
Reproducible research: theory
C. Tobin Magle1.5K views

More from Verena139

Peer judge: Praise and Criticism Detection in F1000Research reviews by
Peer judge: Praise and Criticism Detection in F1000Research reviews Peer judge: Praise and Criticism Detection in F1000Research reviews
Peer judge: Praise and Criticism Detection in F1000Research reviews Verena139
313 views8 slides
Tracking data by
Tracking dataTracking data
Tracking dataVerena139
129 views18 slides
Metrics for oa monographs - introduction by
Metrics for oa monographs - introductionMetrics for oa monographs - introduction
Metrics for oa monographs - introductionVerena139
178 views4 slides
Thoughts on metrics for OA monographs by
Thoughts on metrics for OA monographsThoughts on metrics for OA monographs
Thoughts on metrics for OA monographsVerena139
156 views8 slides
Operas Metrics Service by
Operas Metrics Service Operas Metrics Service
Operas Metrics Service Verena139
163 views22 slides
Reproducibility Analytics Lab by
Reproducibility Analytics Lab Reproducibility Analytics Lab
Reproducibility Analytics Lab Verena139
392 views25 slides

More from Verena139(14)

Peer judge: Praise and Criticism Detection in F1000Research reviews by Verena139
Peer judge: Praise and Criticism Detection in F1000Research reviews Peer judge: Praise and Criticism Detection in F1000Research reviews
Peer judge: Praise and Criticism Detection in F1000Research reviews
Verena139313 views
Tracking data by Verena139
Tracking dataTracking data
Tracking data
Verena139129 views
Metrics for oa monographs - introduction by Verena139
Metrics for oa monographs - introductionMetrics for oa monographs - introduction
Metrics for oa monographs - introduction
Verena139178 views
Thoughts on metrics for OA monographs by Verena139
Thoughts on metrics for OA monographsThoughts on metrics for OA monographs
Thoughts on metrics for OA monographs
Verena139156 views
Operas Metrics Service by Verena139
Operas Metrics Service Operas Metrics Service
Operas Metrics Service
Verena139163 views
Reproducibility Analytics Lab by Verena139
Reproducibility Analytics Lab Reproducibility Analytics Lab
Reproducibility Analytics Lab
Verena139392 views
Prediction markets by Verena139
Prediction markets  Prediction markets
Prediction markets
Verena139156 views
Jisc R&D work in Research Analytics by Verena139
Jisc R&D work in Research AnalyticsJisc R&D work in Research Analytics
Jisc R&D work in Research Analytics
Verena139131 views
ORCID: Jisc&ARMA final meeting update by Josh Brown by Verena139
ORCID: Jisc&ARMA final meeting update by Josh BrownORCID: Jisc&ARMA final meeting update by Josh Brown
ORCID: Jisc&ARMA final meeting update by Josh Brown
Verena1392.4K views
Orcid implementation in uk 29092014 by Verena139
Orcid implementation in uk 29092014Orcid implementation in uk 29092014
Orcid implementation in uk 29092014
Verena139739 views
ORCID: Jisc&ARMA progress meeting update by Josh Brown by Verena139
ORCID: Jisc&ARMA progress meeting update by Josh Brown ORCID: Jisc&ARMA progress meeting update by Josh Brown
ORCID: Jisc&ARMA progress meeting update by Josh Brown
Verena139714 views
Jisc-ARMA ORCID pilot start-up meeting - presentation by Laure Haak (ORCID) by Verena139
Jisc-ARMA ORCID pilot start-up meeting - presentation by Laure Haak (ORCID)Jisc-ARMA ORCID pilot start-up meeting - presentation by Laure Haak (ORCID)
Jisc-ARMA ORCID pilot start-up meeting - presentation by Laure Haak (ORCID)
Verena139858 views
Thunderbolts and lightning outputs by Verena139
Thunderbolts and lightning outputsThunderbolts and lightning outputs
Thunderbolts and lightning outputs
Verena139465 views
Weathering the storm outputs by Verena139
Weathering the storm outputsWeathering the storm outputs
Weathering the storm outputs
Verena139630 views

Recently uploaded

[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx by
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptxDataScienceConferenc1
5 views16 slides
PRIVACY AWRE PERSONAL DATA STORAGE by
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGEantony420421
5 views56 slides
CRM stick or twist.pptx by
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptxinfo828217
11 views16 slides
[DSC Europe 23] Ivana Sesic - Use of AI in Public Health.pptx by
[DSC Europe 23] Ivana Sesic - Use of AI in Public Health.pptx[DSC Europe 23] Ivana Sesic - Use of AI in Public Health.pptx
[DSC Europe 23] Ivana Sesic - Use of AI in Public Health.pptxDataScienceConferenc1
5 views15 slides
Data about the sector workshop by
Data about the sector workshopData about the sector workshop
Data about the sector workshopinfo828217
12 views27 slides
SUPER STORE SQL PROJECT.pptx by
SUPER STORE SQL PROJECT.pptxSUPER STORE SQL PROJECT.pptx
SUPER STORE SQL PROJECT.pptxkhan888620
13 views16 slides

Recently uploaded(20)

[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx by DataScienceConferenc1
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
PRIVACY AWRE PERSONAL DATA STORAGE by antony420421
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGE
antony4204215 views
CRM stick or twist.pptx by info828217
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptx
info82821711 views
Data about the sector workshop by info828217
Data about the sector workshopData about the sector workshop
Data about the sector workshop
info82821712 views
SUPER STORE SQL PROJECT.pptx by khan888620
SUPER STORE SQL PROJECT.pptxSUPER STORE SQL PROJECT.pptx
SUPER STORE SQL PROJECT.pptx
khan88862013 views
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init... by DataScienceConferenc1
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
CRIJ4385_Death Penalty_F23.pptx by yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1006 views
Data Journeys Hard Talk workshop final.pptx by info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821710 views
CRM stick or twist workshop by info828217
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshop
info82821710 views
Organic Shopping in Google Analytics 4.pdf by GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials16 views
Advanced_Recommendation_Systems_Presentation.pptx by neeharikasingh29
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P... by DataScienceConferenc1
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...
Short Story Assignment by Kelly Nguyen by kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0119 views
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M... by DataScienceConferenc1
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...

Data availability and feasibility of validation – A genomics case study

  • 1. Data availability and feasibility of validation – A genomics case study Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma Stuart, Meiko Makita, Verena Weigert, Chris Keene, Nushrat Khan, Katie Drax, Kayvan Kousha University of Wolverhampton, University of Bristol & UK Reproducibility Network & JISC
  • 2. Data sharing experiment goals • Find out how often data is shared in a field with apparently ideal conditions • Write a program to automatically identify shared data of a specified type • Write a program to validate the quality of shared data of a specified type • As a step towards more general automatic shared data discovery and quality control
  • 3. The ideal case study topic? GWAS • Genome Wide Association Study (GWAS) summary statistics • Variation likelihood at large sets of locations of the human genome for measurable traits (e.g. disease susceptibility) • Data is high value and expensive to collect • Often stored in a standard format for internal sharing by consortia • An international repository exists for hosting it, emphasising its importance • NHGRI-EBI Catalog of published genome-wide association studies • Meta-analyses benefit from shared files – increased power and population triangulation • Genomics has a reputation for data sharing
  • 4. https://www.ebi.ac.uk/gwas/diagram Each dot represents a point on the human genome that at least one research study has found to associate with a measurable trait
  • 5. Methods • Medline search for articles that could be primary human GWAS "Molecular Epidemiology"[Majr] AND "Genome- Wide Association Study"[Majr] • Restriction to 2010 and 2017 to identify trends • Three human coders classified 1799 articles for being (a) primary human GWAS and (b) publicly sharing complete primary human GWAS summary statistics • MT and MM follow-up checks of results https://www.biorxiv.org/content/10.1101/622795v1
  • 6. Results Data availability information 2010 2017 Total Percent GWAS location not stated in article 156 139 295 89.4% Broken link or not findable at stated location 3 1 4 1.2% On request to the authors 0 8 8 2.4% On request via dbGaP 2 5 7 2.1% On request via EGA 1 3 4 1.2% On request via another portal 0 3 3 0.9% Free online without login, proprietary format 1 0 1 0.3% Free online without login, plain text 0 8 8 2.4% 10.6% reported sharing GWAS summary statistics in some form
  • 7. Article descriptions of the availability of GWAS summary statistics • Usually in a Data Availability article section (26 out of 35). • Data availability more difficult to identify from the methods (4 articles) and results (3 articles). • Only five data sharing statements described the shared data as GWAS summary statistics, and all five used different phrases • “full GWAS summary statistics”, “Case Oncoarray GWAS data”, “Summary GWAS estimates”, “Summary statistics for the genome-wide association study”, “genome-wide set of summary association statistics” • Descriptions are therefore hard to automatically identify from articles.
  • 8. Conclusions • Data sharing is unlikely to become near-universal when it is optional. • Policy initiatives or mandates are needed to promote data sharing. • Automatically identifying shared data is difficult or impossible in practice because of: • the complexity of articles (multiple data sources and article structures) • a lack of standardisation of terminology • - but data availability statements help Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma Stuart, Meiko Makita, Verena Weigert, Chris Keene, Nushrat Khan, Katie Drax, Kayvan Kousha University of Wolverhampton, University of Bristol & UK Reproducibility Network & JISC
  • 9. Follow-up study: Investigating data availability statements • A program was written to extract data sharing statements from full text articles in XML • Free software Webometric Analyst (http://lexiurl.wlv.ac.uk/), menu: Citations > PMC full text > Data availability statements extract • Manual content analysis for types of information in extracted PMC Open Access Subset data availability statements (n=500) • Test machine learning for classifying data sharing methods from data availability statements
  • 10. Result - how is data shared? Almost all papers with D.S.S. claim to share data. Standardised wordings common e.g., “All relevant data are within the paper.”
  • 11. Results – what data is shared? 38% of data sharing statements specify that all data is shared
  • 12. Results – why is data [not] shared? 91% of data sharing statements give no explanation for their data sharing policy
  • 13. Machine learning • Simple support vector machines (SVM) test for detecting sharing methods from data sharing statements • 87% accurate for: How is the data shared • 89% accurate for: is all the data shared (binary)
  • 14. Software to detect data sharing • Webometric Analyst (free: http://lexiurl.wlv.ac.uk/) tool to extract data sharing statements from a folder of PDFs and classify them • http://lexiurl.wlv.ac.uk/searcher/datashare.html • Needs standard format for these statements • Disciplinary & publisher differences in the uptake of data sharing statements
  • 15. Webometric Analyst output • Attempts to classify what is shared, how(where), and why
  • 16. Summary • Data sharing seems to need mandates to become widespread, even in otherwise best case fields • Shared data is hard to detect precisely because of article complexity and language variation. • Basic information about whether data is shared and where can be extracted automatically from data availability statements. • Applications: Monitoring; More useful in the longer term after standardisation? • Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma Stuart, Meiko Makita, Verena Weigert, Chris Keene, Nushrat Khan, Katie Drax, Kayvan Kousha • University of Wolverhampton, University of Bristol & UK Reproducibility Network & JISC

Editor's Notes

  1. “A single-nucleotide polymorphism, often abbreviated to SNP, is a substitution of a single nucleotide that occurs at a specific position in the genome, where each variation is present to some appreciable degree within a population (e.g. > 1%).” https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism