Your SlideShare is downloading. ×
0
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

150

Published on

OpenAIREplus workshop - “Linking Open Access publications to data – policy development and implementation” (June 11, 2012)

OpenAIREplus workshop - “Linking Open Access publications to data – policy development and implementation” (June 11, 2012)

Published in: Technology, Health & Medicine
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
150
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  1. Linking literature to data in the life sciences OpenAIREplus workshop, Copenhagen, 11 June 2012
  2. Overview • • • • What literature? What data? How we make literature-data connections Case study Challenges and future directions
  3. What literature? What data?
  4. Data Landscape and Definitions Research articles Funder mandates Journal requirements Metadata Standards Big Data: Deposition Primary Unstructured Data *reuse Big Data: Curated Annotation
  5. PMC336623 Extended to several other biological data types
  6. 40000 300 European Nucleotide Archive Ensembl and Ensembl Genomes 250 35000 30000 Genomes • Big data • Thematic data • Public data • Archived data Nucleotides (millions) 45000 25000 20000 15000 200 150 100 10000 50 5000 0 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year 14000000 12000000 25000 UniProt Year InterPro Entries 10000000 8000000 6000000 15000 10000 4000000 5000 2000000 0 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year Year 500000 450000 70000 ArrayExpress 60000 400000 Structures Hybridisations • Two petabytes of data • Scales to 7 pbs raw disk • Majority is DNA Entries 20000 350000 300000 250000 200000 150000 PDBe 50000 40000 30000 20000 100000 10000 50000 0 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year
  7. Two core literature databases • 26 million abstracts PubMed, Patents, Agricola • Website and web services • • • Citation networks Database links Whatizit textmining • over 1.1 million new records per year • 2.2 million full text articles (217K articles with suppl data) • Website • • Supplemented by CiteXplore Additional text mining • over 150K new articles per year
  8. UK PubMed Central Overview • Built in collaboration with PubMed Central USA (+ PMC Canada) since 2006 • Led by the European Bioinformatics Institute since 2011, with the British Library, and the University of Manchester • Supported by 16 UK and 2 European Funders, led by the Wellcome Trust. Research spend: ~ 2 billion GBP • A life-science web-based repository • Manuscript submission service (self archiving by grant holders) • Database of grant information – with details of about 18000 PIs • Grant reporting and funder analysis tool • 250K requests, 40K IPs, 7K direct interactive searches per day
  9. How many articles? Overall: 20% OA (~ 450K OA articles out of 2.2 million total)
  10. How we make literature-data connections
  11. Links • by the author - on submission, as metadata (primary databases) • by database curators - information and links from the literature • expensive, slow, but high quality Text mining • by algorithms that use terminologies (can be subject to lag) • post publication – can find new associations • variable quality, but high throughput
  12. Links from Literature to Databases • • • • • • • • • 800 K 370 K 110 K Proteins Nucleotides OMIM Chemicals Structure Clinical reviews Protein families Protein-protein interactions Gene expression experiments …
  13. Text Mining in UKPMC (2.2 million articles) Semantic Type Gene/Protein Unique Terms Articles Annotations 225,905 1,288,809 15,021,502 GO Terms 32,486 1,806,539 15,016,957 Organism 178,847 1,689,251 12,322,782 Disease 170,592 1,743,212 16,201,198 Accession No. 232,950 65,640 331,329 76,350 1,669,500 22,438,980 Chemical
  14. Case study
  15. 3.9 billion years ago
  16. E. Coli meets humans Human colon cancer DNA repair
  17. 07/21/10 17
  18. Protein structure in PDBe
  19. Link to the literature from the PDBe record
  20. Algorithms that find similar structures
  21. Text mine full text for 1ewq
  22. Towards understanding DNA repair mechanisms
  23. Challenges and future directions
  24. Data-driven science Data re-use: biology is post publication Linking: citing papers and data (provenance and integration) Metrics and attribution Hard decisions about value of keeping complete data sets
  25. Data landscape - possibilities analysis Research articles Unstructured Data Structured links Big Data: Deposition Primary Big Data: Curated Annotation reuse?
  26. PDF HTML GIF JPG TIF MOV DOC Analysis supplied by Mimas, University of Manchester XSL
  27. Solutions that make sense to scientists
  28. http://ukpmc.ac.uk

×