Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

GWAS and DAS

19 views

Published on

Mining Data Availability Statements for GWAS data, presentation by Jo McEntyre, EMBL-EBI

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

GWAS and DAS

  1. 1. Jo McEntyre, EMBL-EBI Mining Data Availability Statements for GWAS data
  2. 2. GWAS and the GWAS Catalog • GWAS analyse variants across the genome to identify loci associated with a disease or phenotype Study metadata including: - Trait - Sample information Publication information Results - Lead associations - Summary statistics GWAS Catalog data
  3. 3. GWAS Catalog content As of October 2019 • 4,220 publications • 7,661 studies • 157,336 variant-trait assoc. • 276 pubs with summary statistics, >8,000 datasets www.ebi.ac.uk/gwas
  4. 4. What is Europe PMC? Europe PMC– free digital archive of biomedical and life sciences research publications
  5. 5. Content in Europe PMC Europe PMC is a partner in PubMed Central International
  6. 6. Text mining infrastructure • Gene-disease relationships • Mutations • GeneRIFs • Diseases and phenotypes • Phosphorylation events • Transcription factor-target interactions • Organisms • Gene/proteins • GO terms • ChEBI • EFO • Grants • Accession numbers
  7. 7. Text mining platform: SciLite application
  8. 8. Accession numbers mined from full text publications ELIXIR Core Data Resources and Deposition Databases
  9. 9. Cross-links between GWAS and Europe PMC
  10. 10. Data Availability statements in Europe PMC
  11. 11. <title> and XML path Title XML path Frequency Data Availability article:front:notes 90,928 Data accessibility article:back:sec 2,694 Data Availability article:back:sec:fn-group 2,580 Data article:body:sec 2,265 Availability of supporting data article:body:sec 1,593 Major datasets article:back:sec:sec 1,074 Database survey article:body:sec 986 Extended Data article:body:sec 851 Data availability article:body:sec 795 Extended Data Figure 1 article:body:sec:SecTag:fig 689 Top 10 combinations of <title> content containing “data” and XML path
  12. 12. Some unhelpful statements
  13. 13. Curating papers for the GWAS catalog
  14. 14. GWAS Catalog literature identification: Query based vs machine learning Query-based Machine learning Precision 6% 27% Recall 100% 96% Improved efficiency 80% reduction in publications to review average 144 to 30/week
  15. 15. Summary statistics in the GWAS Catalog by publication year % of publications with summary statistics over time & in the whole Catalog
  16. 16. Summary statistics for users Facilitating data integration and downstream analyses
  17. 17. The end
  18. 18. GWAS Catalog literature identification • Previously used manual query based search term • Query: genomewide OR genome wide OR genome-wide OR GWAS • Now replaced with machine learning based search • convolutional neural net trained on corpus of GWAS Catalog publications • Collaboration with Zhiyong Lu’s group Lee et al, PMID 30102703 , PloS Comp Bio • ML results triaged by curator in custom Pubtator interface
  19. 19. Old literature search and triage process • Manual search in PubMed • Query: genomewide OR genome wide OR genome- wide OR GWAS • Curator assesses each publication for eligibility for inclusion in GWAS Catalog • Specific eligibility criteria https://www.ebi.ac.uk/gwas/docs/methods/criteria • Genome wide association study of >100,000 variants distributed genome
  20. 20. Deep learning algorithm (convolutional neural net) trained on corpus of GWAS Catalog publications) Figure 1. Lee et al, PMID 30102703 , PloS Comp Bio Machine learning search Corpus of GWAS Catalog publications
  21. 21. GWAS Catalog machine learning literature search method • Precision 27% • Recall 96% Table 3. Lee et al, PMID 30102703 , PloS Comp Bio
  22. 22. Machine learning: • Improved efficiency (80% reduction in publications to review, 144 to 30/week) • Similar capture of eligible studies GWAS Catalog machine learning literature search method vs query based search Table 3. Lee et al, PMID 30102703 , PloS Comp Bio
  23. 23. Uses Narrow-down/prioritise candidate loci Drug target discovery Predict disease risk Understand disease mechanism Statistics on disease data and research
  24. 24. DOI citations within DASs Most popular data repositories based on DOI citations in DASs (Jan-Mar 2019) (?i)(10[.]d{4,9})(?=/)(?=[-._;()/:A-Z0-9]+)

×