Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

602 views

Published on

OpenAIREplus workshop - “Linking Open Access publications to data – policy development and implementation” (June 11, 2012)

Published in: Technology, Health & Medicine
  • Be the first to comment

  • Be the first to like this

Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

  1. 1. Linking literature to data in the life sciences OpenAIREplus workshop, Copenhagen, 11 June 2012
  2. 2. Overview • • • • What literature? What data? How we make literature-data connections Case study Challenges and future directions
  3. 3. What literature? What data?
  4. 4. Data Landscape and Definitions Research articles Funder mandates Journal requirements Metadata Standards Big Data: Deposition Primary Unstructured Data *reuse Big Data: Curated Annotation
  5. 5. PMC336623 Extended to several other biological data types
  6. 6. 40000 300 European Nucleotide Archive Ensembl and Ensembl Genomes 250 35000 30000 Genomes • Big data • Thematic data • Public data • Archived data Nucleotides (millions) 45000 25000 20000 15000 200 150 100 10000 50 5000 0 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year 14000000 12000000 25000 UniProt Year InterPro Entries 10000000 8000000 6000000 15000 10000 4000000 5000 2000000 0 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year Year 500000 450000 70000 ArrayExpress 60000 400000 Structures Hybridisations • Two petabytes of data • Scales to 7 pbs raw disk • Majority is DNA Entries 20000 350000 300000 250000 200000 150000 PDBe 50000 40000 30000 20000 100000 10000 50000 0 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year
  7. 7. Two core literature databases • 26 million abstracts PubMed, Patents, Agricola • Website and web services • • • Citation networks Database links Whatizit textmining • over 1.1 million new records per year • 2.2 million full text articles (217K articles with suppl data) • Website • • Supplemented by CiteXplore Additional text mining • over 150K new articles per year
  8. 8. UK PubMed Central Overview • Built in collaboration with PubMed Central USA (+ PMC Canada) since 2006 • Led by the European Bioinformatics Institute since 2011, with the British Library, and the University of Manchester • Supported by 16 UK and 2 European Funders, led by the Wellcome Trust. Research spend: ~ 2 billion GBP • A life-science web-based repository • Manuscript submission service (self archiving by grant holders) • Database of grant information – with details of about 18000 PIs • Grant reporting and funder analysis tool • 250K requests, 40K IPs, 7K direct interactive searches per day
  9. 9. How many articles? Overall: 20% OA (~ 450K OA articles out of 2.2 million total)
  10. 10. How we make literature-data connections
  11. 11. Links • by the author - on submission, as metadata (primary databases) • by database curators - information and links from the literature • expensive, slow, but high quality Text mining • by algorithms that use terminologies (can be subject to lag) • post publication – can find new associations • variable quality, but high throughput
  12. 12. Links from Literature to Databases • • • • • • • • • 800 K 370 K 110 K Proteins Nucleotides OMIM Chemicals Structure Clinical reviews Protein families Protein-protein interactions Gene expression experiments …
  13. 13. Text Mining in UKPMC (2.2 million articles) Semantic Type Gene/Protein Unique Terms Articles Annotations 225,905 1,288,809 15,021,502 GO Terms 32,486 1,806,539 15,016,957 Organism 178,847 1,689,251 12,322,782 Disease 170,592 1,743,212 16,201,198 Accession No. 232,950 65,640 331,329 76,350 1,669,500 22,438,980 Chemical
  14. 14. Case study
  15. 15. 3.9 billion years ago
  16. 16. E. Coli meets humans Human colon cancer DNA repair
  17. 17. 07/21/10 17
  18. 18. Protein structure in PDBe
  19. 19. Link to the literature from the PDBe record
  20. 20. Algorithms that find similar structures
  21. 21. Text mine full text for 1ewq
  22. 22. Towards understanding DNA repair mechanisms
  23. 23. Challenges and future directions
  24. 24. Data-driven science Data re-use: biology is post publication Linking: citing papers and data (provenance and integration) Metrics and attribution Hard decisions about value of keeping complete data sets
  25. 25. Data landscape - possibilities analysis Research articles Unstructured Data Structured links Big Data: Deposition Primary Big Data: Curated Annotation reuse?
  26. 26. PDF HTML GIF JPG TIF MOV DOC Analysis supplied by Mimas, University of Manchester XSL
  27. 27. Solutions that make sense to scientists
  28. 28. http://ukpmc.ac.uk

×