Literature-Data Integration in the Life SciencesLisbon, Oct 2nd 2012
Publications and Data Sources
Europe PubMed Central26 million abstracts                       2.3 million full text articles                            ...
How many open access articles in UKPMC?                                                                     PubMed (995K) ...
45000 • Big data                                                                                                          ...
Literature citation from data              vsData referal from literature
PMC336623   Extended to several other biological data types
Literature citation from data800 K                     •   Proteins                          •   Nucleotides              ...
Data referral from literature: text miningSemantic Type   Unique Terms             Articles   AnnotationsAccession No.    ...
Annotation of accession numbers (OA)100                                          10090                                    ...
Why is this important? ImplicationsScientific:    Linking articles that cite the same dataCitation:    Data Citation as me...
Case Study of an FP7-funded article (1)
Case Study of an FP7-funded article (2)
Europe PubMed Central content map   Abstract    Full text                                               Citing            ...
AY387398: needle in a haystack
Europe PubMed Central and Institutional Repositories:               content matching                          Number of ar...
People•   Paula Buttery     • Rebholz Group•   Andrew Caines     • Peter Stoehr•   Norman Cobley•   Yuci Gou          • Un...
Upcoming SlideShare
Loading in …5
×

Access to open data through open access articles in the life sciences

431 views

Published on

Access to open data through open access articles in the life sciences. - Johanna McEntyre (EMBL-European Bioinformatics Institute)

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
431
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Access to open data through open access articles in the life sciences

  1. 1. Literature-Data Integration in the Life SciencesLisbon, Oct 2nd 2012
  2. 2. Publications and Data Sources
  3. 3. Europe PubMed Central26 million abstracts 2.3 million full text articles Citation networks Database links Text-mining 2006 2011 2012 2016?
  4. 4. How many open access articles in UKPMC? PubMed (995K) UKPMC (18%,182K) OA (9.6%, 96K) 200 200 200 200 200 200 200 200 200 20 20 Publication Date Total: 489,000 OA articles
  5. 5. 45000 • Big data 300 European Nucleotide Archive Ensembl and Ensembl Genomes Nucleotides (millions) 40000 250 35000 • Thematic data 30000 200 Genomes 25000 150 20000 • Public data 15000 10000 100 50 • Archived data 5000 0 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year 14000000 25000 Year 12000000 UniProt InterPro 20000 10000000 Entries• Two petabytes of data Entries 8000000 15000• Scales to 7 pbs raw disk 6000000 10000 4000000• Majority is DNA 5000 2000000 0 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year Year 500000 70000 450000 ArrayExpress PDBe Hybridisations 400000 60000 Structures 350000 50000 300000 40000 250000 200000 30000 150000 20000 100000 10000 50000 0 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year Year Figure 2. Growth of key resources
  6. 6. Literature citation from data vsData referal from literature
  7. 7. PMC336623 Extended to several other biological data types
  8. 8. Literature citation from data800 K • Proteins • Nucleotides • OMIM • Chemicals • Structure • Clinical reviews 370 K • Protein families • Protein-protein interactions • Gene expression experiments 110 K
  9. 9. Data referral from literature: text miningSemantic Type Unique Terms Articles AnnotationsAccession No. 233,017 66,356 387,787Chemical 76,712 1,694,385 83,923,066Disease 171,692 1,768,214 57,821,871Gene/Protein 227,318 1,310,382 77,189,022GO Terms 32,664 1,832,294 65,061,579Organism 180,637 1,713,280 70,832,222 2.3 million articles
  10. 10. Annotation of accession numbers (OA)100 10090 9080 8070 7060 6050 publisher-annotated 50 text-mined40 4030 3020 2010 10 0 0 ~10,000 articles >25,000 articles BMC Genomics: 1,484 TM tagged, 4,337 articles (1135 tagged) PLoS One: 4,226 TM tagged, 42,888 articles SenayKafkas and Jee-Hyub Kim
  11. 11. Why is this important? ImplicationsScientific: Linking articles that cite the same dataCitation: Data Citation as measure of impact (Thomson: Data citation index) Context of data citation: submission, reuse, analysisOperational: Services for publishers to improve Accession number tagging Editorial policies and adherence Extension of NLM DTD Lessons learned for considering unstructured data That we can perform this analysis at all highlights a benefit of Open Access
  12. 12. Case Study of an FP7-funded article (1)
  13. 13. Case Study of an FP7-funded article (2)
  14. 14. Europe PubMed Central content map Abstract Full text Citing articles Unstructured Datasets Databases Extracted terms Citing articles
  15. 15. AY387398: needle in a haystack
  16. 16. Europe PubMed Central and Institutional Repositories: content matching Number of article IDs OpenAIRE plus **Coming soon: RESTful interface for data linked to articles
  17. 17. People• Paula Buttery • Rebholz Group• Andrew Caines • Peter Stoehr• Norman Cobley• Yuci Gou • University of Manchester• SenayKafkas • British Library• JyothiKaturi• Oliver Kilian • OpenAIRE/OpenAIRE Plus• Jee-Hyub Kim• Nikos Marinos • NCBI, NLM• Jo McEntyre• Xingjun Pi• Philip Rossiter

×