Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Linking database submissions to primary citations with PubMed Central
Heather A. Piwowar and Wendy W. Chapman
Department o...
all OA articles that were linked from the GEO         (geo OR omnibus)
DataSet primary submission field, plus articles    ...
the GEO database, we estimate our query could         serves as rich documentation for the dataset,
return roughly 117 (39...
thus unlikely to add citations without permission.         database and tools update. Nucleic
Perhaps in these situations ...
Upcoming SlideShare
Loading in …5

Full text


Published on

  • Be the first to comment

  • Be the first to like this

Full text

  1. 1. Linking database submissions to primary citations with PubMed Central Heather A. Piwowar and Wendy W. Chapman Department of Biomedical Informatics, University of Pittsburgh Background: Dataset submissions are growing available in abundance in PubMed. Analysis of exponentially. Links between dataset machine-readable full text would permit a much submissions and primary literature that describe deeper and wider scope of study, but the data collection are useful for many reasons: assembling a corpus has been hindered by the rich documentation, proper attribution, improved complex, disparate, and decentralized access information retrieval, and enhanced text/data processes and licenses of publisher websites. integration for analysis. Unfortunately, many While PMC does not permit automated database submissions do not include primary downloading of non-Open Access full text (as citation links, as database submissions are often per publisher licenses), full text can be queried made prior to publication. We suggest that from the PMC interface. Integrating the ability to automated tools can be developed to help query the full text of all future NIH-funded identify links between dataset submissions and research reports in combination with MeSH the primary literature. These tools require full terms and other NCBI Entrez database links text to differentiate cases of data sharing from offers exciting possibilities. data reuse and other contexts. In this study, we In this report, we explore the potential of one explore the possibility that deep analysis of full such application: linking articles that describe text may not be necessary, thereby enabling the data collection to their database submission querying of all reports in PubMed Central. entries. Databases that store research datasets Methods: We trained machine learning tree often include citation links to the articles that and rule-based classifiers on full-text open- describe the initial generation and use of the access article unigram vectors, with the datasets. As we discuss below, these links are existence of a primary citation link from NCBI’s valuable, often missing, and time-consuming to Gene Expression Omnibus (GEO) database manually derive. We previously developed submission records as the binary output class. several NLP systems to identify declarations of We manually combined and simplified the database submission within research articles2, classifier trees and rules to create a query however these systems required access to compatible with the interface for PubMed complete full text for feature extraction. To take Central. advantage of the PMC resource, here we Results: The query identified 40% of non-OA develop a system restricted to rules that can be articles with dataset submission links from GEO expressed within the PMC query interface. (recall), and 65% of the returned articles without dataset submission links were manually judged We apply our system to gene expression to include statements of dataset deposit despite microarray studies deposited in NCBI's Gene having no link from the database (applicable Expression Omnibus (GEO) database3. Gene precision). expression data are expensive to collect, often Conclusion: We hope this work inspires future but not always shared, and valuable for reuse. enhancements, and highlights the opportunities The GEO database is the largest repository for for simple full-text queries in PubMed Central gene expression datasets, is well integrated with given the mandated influx of NIH-funded PMC query results, and contains links from research reports. submitted datasets to primary citation reports. Introduction Methods The expected deluge of full-text biomedical Our goal was to develop a PMC query for research articles into PubMed Central (PMC), as retrieving articles that mention depositing a mandated by recent NIH policy1, creates many dataset into GEO. We developed the query opportunities for improving research tools and using a selection of Open Access (OA) articles, processes. Most biomedical text mining and and evaluated it on non-OA articles. natural language processing (NLP) has been limited to titles and abstracts: these are We used a gold standard based on our previous work.2 Positive cases came from two sources:
  2. 2. all OA articles that were linked from the GEO (geo OR omnibus) DataSet primary submission field, plus articles AND microarray without a primary citation link from the GEO AND "gene expression" database that were nonetheless judged to have AND accession deposited data into GEO. Manual judgment for NOT (databases the selected OA articles was based on reviewing OR user OR users the full-text reports. Negative cases were OR (public AND accessed) considered those articles that were not linked OR (downloaded AND published)) from GEO Datasets and were manually classified as lacking any indication within their This query retrieved 772 articles, of which 455 full text that they had deposited a dataset into were not open access. The results included 385 GEO. of the 966 PubMed Central non-OA articles with links to the GEO Datasets (“pmc gds”[filter] We used NCBI's Entrez E-Utilities, PubMed NOT "open access"[filter]), for a recall of Central, Python, TagHelper Tools4, and Weka5 40%. to remove rare words (<40 occurrences) and stopwords, create unigram bag-of-word vectors, Next, we limited the query to non-OA articles automatically select features, and build tree without a PMC link to the GEO Datasets. We (J48) and rule (PART) machine learning manually determined that 44 of the 68 results classifiers for a variety of parameter values. We included a statement of dataset submission to manually derived a PMC-compatible query GEO within their full-text report. This indicates based on the most robust feature selection and an overall query precision of 94% (385+44/455) classifier results. for retrieving articles that have deposited datasets into GEO and an applicable precision Recall was calculated by determining what of 65% (44/68) for retrieving articles that don’t percentage of the non-OA (since OA was used have PMC links but should. Our error analysis in training) PMC articles with links to Gene of the 24 false positives found that 13 of the Expression Datasets were found by the query. articles referenced GEO datasets in the context We evaluated applicable precision by manually of dataset reuse rather than submission reviewing the non-OA query hits for articles that (including 2 reusing their own work), 4 are not currently linked to GEO datasets and referenced GEO in the context of platform determining whether they indeed included descriptions rather than datasets, and 5 didn’t statements of dataset submission to GEO. reference the GEO database at all but rather mentioned the word “geo” for another purpose, Finally, we compared the current count of NIH- usually the beta-geo gene. funded, GEO-linked articles in PubMed to those currently within PMC to project the possible The PubMed database contains 4291 articles impact our query might have once all NIH- with links from GEO DataSets funded datasets are deposited in PMC. (in PubMed: “pubmed gds”[filter]). Thus, the addition of an estimated 115 (177*65%) novel Results true positive links would increase the current number of dataset-submission-to-primary- The training set was composed of open-access citation links by about 2.6%. articles, including 550 positive examples (articles that had links from the GEO primary citation We also estimated how the query impact might fields or were manually determined to have increase once new NIH-funded articles are shared data in GEO) and 165 negatives (articles deposited in PMC. PMC contains 202 articles without links from GEO). We combined the rules published in 2007, funded by the NIH, and linked and tree branches that occurred most frequently from GEO DataSets. In comparison, the across the trained machine learning classifiers to PubMed database contains 596 such articles— compose the following PubMed Central query: almost three times as many. Our query returned 39 hits for NIH-funded articles published in 2007 that were not linked from GEO DataSets. If all NIH-funded articles were in PMC, and if similar patterns exist for microarray papers that share their data but do not currently have links from
  3. 3. the GEO database, we estimate our query could serves as rich documentation for the dataset, return roughly 117 (39*3) new articles per year whether as free text or as meta-data mark-up as identifying data sharing not included in primary illustrated by the BioLit PDB Clone citations, of which 76 (117*65%) might be true ( Second, the citation positives. This would increase the annual count provides a crucial mechanism for attributing of primary citation links by about 5.5% (1310 recognition to the originators of the dataset upon NIH and non-NIH PubMed articles with GEO reuse.6 Third, the citation provides a link for Database links in 2007 + 76 projected enhanced information retrieval or text/data additions). integration pathways.7,8 The trivial query, "gene expression omnibus” Unfortunately, links to primary citations are often AND (submitted OR deposited) resulted in a missing from database submission entries 34% recall and 90% overall precision. because datasets are usually submitted before publication details are known.9 Evidence Discussion suggests that a significant number of links from database submissions to the primary literature Database submissions often include a link to the may be missing. For example, the PDB data research article that describes the original data uniformity project of 2000 found that 33% of collection conditions and interpretations. Our submission entries lacked a citation. Half of results suggest a simple query on full-text can these were recovered automatically using the list automatically identify database submission of submitter names, 40% through manual primary citation links with a precision of 94% and searches of PubMed and the Thomson ISI recall of 40%. A trivial full-text query identified databases, and 10% (3% of total) were articles with 90% precision and 34% recall. presumed to represent work that was never Precision for the subset of articles without published.10 More recently, another large-scale existing links from the GEO database was 65%. PDB remediation project looked at improving the The methods we describe can be used to quality of many fields, including primary develop queries for identifying primary citations citations. As of May 2005, 8508 (27%) of the across a wide variety of datatypes and 31663 database submissions required databases. remediation due to inconsistent or missing PubMed IDs and citation information. A report The approach outlined in this study is much near the end of the remediation process11 more practical than a complex regular- estimated that manual searches found PubMed expression classifier running on article full text. IDs for 1226 entries, citation information without Processing full-text articles requires not only PubMed IDs for 387 entries, and about 700 (2% access licenses and reuse permissions (or a of 31663) were presumed unpublished. These limitation to open access content) but also the examples suggest that a sizeable number of maintenance of a text repository and entries may be missing citation fields, and that classification system. Querying full text through most of them are recoverable. Unfortunately, PubMed Central, in contrast, is publicly these efforts are time-intensive and thus difficult available, requires no infrastructure beyond an to incorporate into the workflow of busy internet connection, and covers all OA and non- biocurators.12 NLP is already being used to aid OA articles within PMC. database curation in a variety of tasks13, and we believe it can also help biocurators identify We imagine this query could be used in two missing links to primary citations. ways. It could be used by dataset-seekers, by appending it onto PubMed or PMC queries to Procedurally, our query results could be find articles with shared datasets. Alternatively, manually confirmed and then used to update it could be used by biocurators as a tool for database records. GEO asks for omitted identifying primary citations that may be missing citations from their database submission fields. This ( latter use has broad implications, which we html); we have sent them our findings and they discuss further below. have updated their database to include the missing links identified in this study. Other Links between shared datasets and primary databases, however, consider the submission citations have many purposes. First, the citation record the property of the submitter14 and are
  4. 4. thus unlikely to add citations without permission. database and tools update. Nucleic Perhaps in these situations an automated Acids Res 35(2007). system could be developed to email submitters 4. Rose, C.P., et al. Analyzing requesting they add or permit the addition of the Collaborative Learning Processes citation. Automatically: Exploiting the Advances of Computational Linguistics in The performance of our query could undoubtedly Computer-Supported Collaborative be improved through systematic refinement.15 Learning, International Journal of Future work could involve deriving additional Computer Supported Collaborative cues through bootstrapping and semi-supervised Learning (In Press). learning, including stemmed words with 5. Witten, I.H. & Frank, E. Data Mining: wildcards, and refining the query based on error Practical machine learning tools and analysis (for example, excluding hits on beta- techniques, 2nd Edition, Morgan geo). Additional improvements could be Kaufmann, San Francisco (2005). achieved if the PMC query capabilities were 6. Compete, collaborate, compel. Nat enhanced. For this application, it would be Genet 39(2007). particularly useful to remove negation and modal 7. Butte, A.J. & Chen, R. Finding disease- verbs from the stop word list related genomic experiments within an ( international repository: first steps in highlight=stopwords& translational bioinformatics. AMIA Annu medhelp.T43 ) so they could be included within Symp Proc, 106-110 (2006). query n-grams. 8. Muller, H.M., Kenny, E.E. & Sternberg, P.W. Textpresso: an ontology-based Linking shared datasets to primary citations information retrieval and extraction increases their value: the datasets become system for biological literature. PLoS easier to find, easier to understand, easier to Biol 2(2004). responsibly acknowledge, and easier to 9. Piwowar, H.A. & Chapman, W.W. A integrate with other information. Dataset review of journal policies for sharing deposits are growing exponentially. As NIH- research data. Available from Nature funded research makes its way to PMC, the Precedings< opportunity for creating links between datasets npre.2008.1700.1>, (2008). and full-text articles increases enormously. We 10. Bhat, T.N., et al. The PDB data hope this study provides a useful preliminary tool uniformity project. Nucleic Acids Res 29, and inspires further research in this area. 214-218 (2001). 11. PDBj News Letter. in Volume 7, March Our manual annotation results are available at 2006< . ewsletter_vol7_e.pdf> (2006). 12. Burkhardt, K., Schneider, B. & Ory, J. A Funding biocurator perspective: annotation at the National Library of Medicine (5T15- Research Collaboratory for Structural LM007059-19 to HAP, 1R01-LM009427-01 to Bioinformatics Protein Data Bank. PLoS WWC) Computational Biology 2(2006). 13. Karamanis, N., et al. Natural Language Processing in aid of FlyBase curators. References BMC Bioinformatics 9(2008). 1. NOT-OD-08-033 Revised Policy on 14. Pennisi, E. DNA DATA: Proposal to Enhancing Public Access to Archived 'Wikify' GenBank Meets Stiff Resistance. Publications Resulting from NIH-Funded Science 319, 1598-1599 (2008). Research. 15. Zhang, L., Ajiferuke, I. & Sampson, M. 2. Piwowar, H.A. & Chapman, W.W. Optimizing search strategies to identify Identifying Data Sharing in Biomedical randomized controlled trials in Literature. Available from Nature MEDLINE. BMC Medical Research Precedings< Methodology 6, 23 (2006). npre.2008.1721.1> (2008). 3. Barrett, T., et al. NCBI GEO: mining tens of millions of expression profiles--