Linking Database Submissions to Primary Citations with PubMed Central Heather Piwowar and Wendy Chapman Department of Biom...
 
 
<ul><li>These links are important for several reasons </li></ul>
 
 
 
<ul><li>Sometimes the links are easy to discover </li></ul>
 
 
 
 
But the meaning of hyperlinks is ambiguous:
And often no hyperlinks at all:
<ul><li>One way to identify links: </li></ul><ul><li>NLP systems that identify  statements of shared data  from within ful...
<ul><li>BUT this requires developing and maintaining a full-text archive! </li></ul>
<ul><li>What about using PubMed Central? </li></ul>
<ul><li>Usage? </li></ul><ul><li>scientists looking for datasets for reuse </li></ul><ul><li>curators looking for primary ...
<ul><li>Goal: </li></ul><ul><li>Use the simple, full-text query interface of  PubMed Central  to identify articles with sh...
<ul><li>Method: </li></ul><ul><li>Gene expression microarray data </li></ul><ul><li>GEO database </li></ul>
<ul><li>Method: </li></ul><ul><li>Open Access articles to train </li></ul><ul><li>Non-Open access articles to test </li></...
<ul><li>Gold Standard: </li></ul><ul><li>True positives (N=550) Articles with primary citation links from GEO + screening ...
<ul><li>Building the query: </li></ul><ul><li>Used full-text of open-access cohort </li></ul><ul><li>Removed words <40 occ...
<ul><li>(geo OR omnibus)  AND microarray  AND &quot;gene expression&quot;  AND accession NOT (databases    OR user OR user...
<ul><li>(geo OR omnibus)  AND microarray  AND &quot;gene expression&quot;  AND accession NOT (databases    OR user OR user...
<ul><li>(geo OR omnibus)  AND microarray  AND &quot;gene expression&quot;  AND accession NOT (databases    OR user OR user...
<ul><li>(geo OR omnibus)  AND microarray  AND &quot;gene expression&quot;  AND accession NOT (databases    OR user OR user...
<ul><li>(geo OR omnibus)  AND microarray  AND &quot;gene expression&quot;  AND accession NOT (databases    OR user OR user...
<ul><li>Evaluation Results </li></ul><ul><li>40% recall </li></ul><ul><li>94% precision,  65% for those not yet linked </l...
<ul><li>Limitations </li></ul><ul><li>only one datatype </li></ul><ul><li>database-centric </li></ul><ul><li>performance s...
<ul><li>Impact? </li></ul><ul><li>Today’s performance: </li></ul><ul><ul><li>would increase GEO links by 2.6% </li></ul></...
<ul><li>We hope this work </li></ul><ul><ul><li>inspires future enhancements, and </li></ul></ul><ul><ul><li>highlights th...
Thank you <ul><li>Advisor:  Dr. Wendy Chapman </li></ul><ul><li>Funders:  NLM and Pitt DBMI </li></ul><ul><li>Enablers:  E...
Upcoming SlideShare
Loading in …5
×

BIOLINK 2008: Linking database submissions to primary citations with PubMed Central

725 views

Published on

Abstract: Background: Dataset submissions are growing exponentially. Links between dataset submissions and primary literature that describe the data collection are useful for many reasons: rich documentation, proper attribution, improved information retrieval, and enhanced text/data integration for analysis. Unfortunately, many database submissions do not include primary citation links, as database submissions are often made prior to publication. We suggest that automated tools can be developed to help identify links between dataset submissions and the primary literature. These tools require full text to differentiate cases of data sharing from data reuse and other contexts. In this study, we explore the possibility that deep analysis of full text may not be necessary, thereby enabling the querying of all reports in PubMed Central. Methods: We trained machine learning tree and rule-based classifiers on full-text open-access article unigram vectors, with the existence of a primary citation link from NCBI’s Gene Expression Omnibus (GEO) database submission records as the binary output class. We manually combined and simplified the classifier trees and rules to create a query compatible with the interface for PubMed Central. Results: The query identified 40% of non-OA articles with dataset submission links from GEO (recall), and 65% of the returned articles without dataset submission links were manually judged to include statements of dataset deposit despite having no link from the database (applicable precision). Conclusion: We hope this work inspires future enhancements, and highlights the opportunities for simple full-text queries in PubMed Central given the mandated influx of NIH-funded research reports.

Published in: Health & Medicine, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
725
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

BIOLINK 2008: Linking database submissions to primary citations with PubMed Central

  1. 1. Linking Database Submissions to Primary Citations with PubMed Central Heather Piwowar and Wendy Chapman Department of Biomedical Informatics University of Pittsburgh BioLINK 2008
  2. 4. <ul><li>These links are important for several reasons </li></ul>
  3. 8. <ul><li>Sometimes the links are easy to discover </li></ul>
  4. 13. But the meaning of hyperlinks is ambiguous:
  5. 14. And often no hyperlinks at all:
  6. 15. <ul><li>One way to identify links: </li></ul><ul><li>NLP systems that identify statements of shared data from within full text. </li></ul>
  7. 16. <ul><li>BUT this requires developing and maintaining a full-text archive! </li></ul>
  8. 17. <ul><li>What about using PubMed Central? </li></ul>
  9. 18. <ul><li>Usage? </li></ul><ul><li>scientists looking for datasets for reuse </li></ul><ul><li>curators looking for primary citations </li></ul><ul><li>researchers studying data sharing behaviour </li></ul>
  10. 19. <ul><li>Goal: </li></ul><ul><li>Use the simple, full-text query interface of PubMed Central to identify articles with shared datasets </li></ul>
  11. 20. <ul><li>Method: </li></ul><ul><li>Gene expression microarray data </li></ul><ul><li>GEO database </li></ul>
  12. 21. <ul><li>Method: </li></ul><ul><li>Open Access articles to train </li></ul><ul><li>Non-Open access articles to test </li></ul><ul><li>Gene-expression articles selected by MeSH term query </li></ul>
  13. 22. <ul><li>Gold Standard: </li></ul><ul><li>True positives (N=550) Articles with primary citation links from GEO + screening of full-text </li></ul><ul><li>True negatives (N=165) The rest </li></ul>
  14. 23. <ul><li>Building the query: </li></ul><ul><li>Used full-text of open-access cohort </li></ul><ul><li>Removed words <40 occurrences </li></ul><ul><li>Unigram bag-of-words vectors </li></ul><ul><li>Tree and Rule algorithms, a variety of parameters </li></ul>
  15. 24. <ul><li>(geo OR omnibus) AND microarray AND &quot;gene expression&quot; AND accession NOT (databases OR user OR users OR (public AND accessed) OR (downloaded AND published)) </li></ul>
  16. 25. <ul><li>(geo OR omnibus) AND microarray AND &quot;gene expression&quot; AND accession NOT (databases OR user OR users OR (public AND accessed) OR (downloaded AND published)) </li></ul>
  17. 26. <ul><li>(geo OR omnibus) AND microarray AND &quot;gene expression&quot; AND accession NOT (databases OR user OR users OR (public AND accessed) OR (downloaded AND published)) </li></ul>
  18. 27. <ul><li>(geo OR omnibus) AND microarray AND &quot;gene expression&quot; AND accession NOT (databases OR user OR users OR (public AND accessed) OR (downloaded AND published)) </li></ul>
  19. 28. <ul><li>(geo OR omnibus) AND microarray AND &quot;gene expression&quot; AND accession NOT (databases OR user OR users OR (public AND accessed) OR (downloaded AND published)) </li></ul>
  20. 29. <ul><li>Evaluation Results </li></ul><ul><li>40% recall </li></ul><ul><li>94% precision, 65% for those not yet linked </li></ul><ul><li>worse than full-NLP results (~ 89%,83%) </li></ul><ul><li>slightly better than trivial query (34%,90%) </li></ul>
  21. 30. <ul><li>Limitations </li></ul><ul><li>only one datatype </li></ul><ul><li>database-centric </li></ul><ul><li>performance so far is rather mediocre… </li></ul>
  22. 31. <ul><li>Impact? </li></ul><ul><li>Today’s performance: </li></ul><ul><ul><li>would increase GEO links by 2.6% </li></ul></ul><ul><ul><li>by 5.5% annually when all NIH in PMC </li></ul></ul><ul><li>Double the recall, to 80%: </li></ul><ul><ul><li>double the numbers above </li></ul></ul><ul><li>GEO curators added the 40 links identified by this study </li></ul>
  23. 32. <ul><li>We hope this work </li></ul><ul><ul><li>inspires future enhancements, and </li></ul></ul><ul><ul><li>highlights the opportunities for simple full-text queries in PubMed Central given the mandated influx of NIH-funded research reports. </li></ul></ul>
  24. 33. Thank you <ul><li>Advisor: Dr. Wendy Chapman </li></ul><ul><li>Funders: NLM and Pitt DBMI </li></ul><ul><li>Enablers: Everyone who deposits their publications in PubMed Central! </li></ul><ul><li>My shared data: www.dbmi.pitt.edu/piwowar </li></ul><ul><li>Share your research data too! </li></ul>

×