Identifying data sharing
 in the biomedical literature

       Heather Piwowar and Wendy Chapman
Department of Biomedical ...
Our full paper:




Visualized as a “Wordle”
  (font size ~ word frequency, location and orientation are random)
Created at IBM’s data sharing and visualization site Many Eyes
Our aim:
Identify research articles for which the authors
have shared their datasets


For this research:
sharing = submit...
Links between article and data
         are important
The data provides detail for the
      results of the article
The article provides detail for the
               data
Specialized searching methods help us find articles
  OR data...
but what about when we want articles WITH data?
How can we find articles that have
      shared their datasets?
Sometimes the links are easy to discover
1. Through database citations:
When authors upload data to a database, they
have the opportunity to cite the paper that
de...
Unfortunately, the citation is often left blank
   because the data is submitted before
                    Text
         ...
2. Through hyperlink urls in the text
Authors often reference their datasets within
their paper with a website url
But the meaning of the hyperlinks is ambiguous.
Sometimes they point to datasets that have been
         accessed, rather ...
But the meaning of the hyperlinks is ambiguous.
Sometimes they point to datasets that have been
         accessed, rather ...
And often the text contains no hyperlinks at all:
3. Through text mining
What if we could extract phrases like

“data of the experiment can be accessed at”
full-text phrases containing “... accessed”
“can be accessed” suggests data is shared
BUT “was/were accessed” suggests data reuse!
full-text phrases containing “... downloaded”
“was/were downloaded” suggests data reuse
while “can be downloaded” suggests data sharing
Our aim:
Identify research articles for which the authors
have shared their raw datasets.


Proposed approach:
Develop a s...
Materials:
Full text from a subset of the open access literature
Database submission citations from five databases:
    • G...
Our Gold standard:


An article was considered to have a “shared dataset” if
the article was cited within the primary subm...
Approach:
For those articles that mention database names,
    • Extract a 300-character window around every
     mention o...
Results:
• queried 24 000 articles across 27 journals
• 25% of all open access articles mentioned one
  of the database na...
True positives:

23% of the articles that mentioned a
database were cited from within a database
submission field

= eviden...
Three simple methods
  for identifying sharing
Does the excerpt surrounding the database name
contain:
1. the word “access...
Two complex methods:

4. A manually-derived regular expression to
   match lexical cues that suggest sharing

5. An automa...
Snippet of manually-developed
     regular expression
                        accessioned
                        added
  ...
How accurately were these methods able to
identify papers with evidence of public
database submissions?
Recall:   % of papers cited in database submission fields
          that were found by our methods
Recall:   % of papers cited in database submission fields
          that were found by our methods



                     ...
Recall:   % of papers cited in database submission fields
          that were found by our methods



                     ...
Recall:   % of papers cited in database submission fields
          that were found by our methods



                     ...
Precision:      % of papers found by our methods
  that were cited in database submissions fields
Precision:      % of papers found by our methods
  that were cited in database submissions fields
                         ...
Precision:      % of papers found by our methods
  that were cited in database submissions fields



                      ...
Precision:      % of papers found by our methods
  that were cited in database submissions fields

                        ...
Precision vs. Recall plot of all methods
               for each database.




Diverse!
Relative strength of methods for this task
             across databases

                    bag of words


             ...
Limitations:


• bias due to manual screening of negatives
• database-centric classifier
• approach requires computational ...
Impact:


• A recent version that runs in PubMed Central:
    • could increase GEO article links by 2.6%
    • by 5.5% ann...
Ongoing work:


1. Continue focusing on methods that use existing
   full-text query interfaces, like PubMed Central
2. Us...
Thanks to
 the Dept of Biomedical Informatics at the U of Pittsburgh,

 the NLM for funding through training grant 5 T15 L...
Our manual filter for additional positive classifications
 identified more cases in some databases than others: we
      recl...
Usage?


• scientists looking for datasets for reuse
• curators looking for primary citations
• researchers studying data ...
Regular expression

•   Precise one +


•   "(b(accession.{0,20}(for|at).{0,100}(is|are)))",


•                          ...
Precise Regular expression

•   we
    have
    has
    is
    are
    was
    were
    be
    been

    accessioned|added...
Stopwords are important!
Recall
Precision
Evaluation
•   queried 24 000 articles across 27 journals
•   25% mentioned one of the database names
•   development set ...
Research data




http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:H...
Research data




http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:H...
Research data



                PAST MEDICAL HISTORY:
     Past medical history showed she had superficial
     phlebitis ...
Research data



                PAST MEDICAL HISTORY:
     Past medical history showed she had superficial
     phlebitis ...
Research data



                PAST MEDICAL HISTORY:
     Past medical history showed she had superficial
     phlebitis ...
Research data



                PAST MEDICAL HISTORY:
     Past medical history showed she had superficial
     phlebitis ...
Piwowar AMIA 2008:  Identifying data sharing in biomedical literature
Piwowar AMIA 2008:  Identifying data sharing in biomedical literature
Piwowar AMIA 2008:  Identifying data sharing in biomedical literature
Piwowar AMIA 2008:  Identifying data sharing in biomedical literature
Piwowar AMIA 2008:  Identifying data sharing in biomedical literature
Piwowar AMIA 2008:  Identifying data sharing in biomedical literature
Piwowar AMIA 2008:  Identifying data sharing in biomedical literature
Upcoming SlideShare
Loading in...5
×

Piwowar AMIA 2008: Identifying data sharing in biomedical literature

703

Published on

Many policies and projects now encourage investigators to share their raw research data with other scientists. Unfortunately, it is difficult to measure the effectiveness of these initiatives because data can be shared in such a variety of mechanisms and locations. We propose a novel approach to finding shared datasets: using NLP techniques to identify declarations of dataset sharing within the full text of primary research articles. Using regular expression patterns and machine learning algorithms on open access biomedical literature, our system was able to identify 61% of articles with shared datasets with 80% precision. A simpler version of our classifier achieved higher recall (86%), though lower precision (49%). We believe our results demonstrate the feasibility of this approach and hope to inspire further study of dataset retrieval techniques and policy evaluation.

Published in: Health & Medicine, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
703
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Piwowar AMIA 2008: Identifying data sharing in biomedical literature

  1. 1. Identifying data sharing in the biomedical literature Heather Piwowar and Wendy Chapman Department of Biomedical Informatics, U of Pittsburgh
  2. 2. Our full paper: Visualized as a “Wordle” (font size ~ word frequency, location and orientation are random)
  3. 3. Created at IBM’s data sharing and visualization site Many Eyes
  4. 4. Our aim: Identify research articles for which the authors have shared their datasets For this research: sharing = submitted to centralized databases
  5. 5. Links between article and data are important
  6. 6. The data provides detail for the results of the article
  7. 7. The article provides detail for the data
  8. 8. Specialized searching methods help us find articles OR data... but what about when we want articles WITH data?
  9. 9. How can we find articles that have shared their datasets?
  10. 10. Sometimes the links are easy to discover
  11. 11. 1. Through database citations: When authors upload data to a database, they have the opportunity to cite the paper that describes the data collection
  12. 12. Unfortunately, the citation is often left blank because the data is submitted before Text the paper is published
  13. 13. 2. Through hyperlink urls in the text Authors often reference their datasets within their paper with a website url
  14. 14. But the meaning of the hyperlinks is ambiguous. Sometimes they point to datasets that have been accessed, rather than submitted.
  15. 15. But the meaning of the hyperlinks is ambiguous. Sometimes they point to datasets that have been accessed, rather than submitted.
  16. 16. And often the text contains no hyperlinks at all:
  17. 17. 3. Through text mining
  18. 18. What if we could extract phrases like “data of the experiment can be accessed at”
  19. 19. full-text phrases containing “... accessed”
  20. 20. “can be accessed” suggests data is shared
  21. 21. BUT “was/were accessed” suggests data reuse!
  22. 22. full-text phrases containing “... downloaded”
  23. 23. “was/were downloaded” suggests data reuse
  24. 24. while “can be downloaded” suggests data sharing
  25. 25. Our aim: Identify research articles for which the authors have shared their raw datasets. Proposed approach: Develop a system to identify statements of shared data from an article’s full text.
  26. 26. Materials: Full text from a subset of the open access literature Database submission citations from five databases: • Genbank • Protein Data Bank • Gene Expression Omnibus • ArrayExpress • Stanford Microarray Database
  27. 27. Our Gold standard: An article was considered to have a “shared dataset” if the article was cited within the primary submission field of a database entry (+ a small amount of manual screening to find additional positives based on full text)
  28. 28. Approach: For those articles that mention database names, • Extract a 300-character window around every mention of a database name • Apply various mining algorithms to decide if there is evidence that the authors deposited data from this study in the database
  29. 29. Results: • queried 24 000 articles across 27 journals • 25% of all open access articles mentioned one of the database names (50% Genbank) • development set of 4434 articles training set of 2000 test set of 1028
  30. 30. True positives: 23% of the articles that mentioned a database were cited from within a database submission field = evidence that article shared its data!
  31. 31. Three simple methods for identifying sharing Does the excerpt surrounding the database name contain: 1. the word “accession” 2. an accession number 3. a URL
  32. 32. Two complex methods: 4. A manually-derived regular expression to match lexical cues that suggest sharing 5. An automatically-derived bag of words decision tree
  33. 33. Snippet of manually-developed regular expression accessioned added archived we assigned deposited have entered has imported is included + are inserted loaded was lodged were placed be posted been provided registered reported to stored submitted uploaded to
  34. 34. How accurately were these methods able to identify papers with evidence of public database submissions?
  35. 35. Recall: % of papers cited in database submission fields that were found by our methods
  36. 36. Recall: % of papers cited in database submission fields that were found by our methods Best method for recall depends on database
  37. 37. Recall: % of papers cited in database submission fields that were found by our methods “accession” good for some, <url> for others
  38. 38. Recall: % of papers cited in database submission fields that were found by our methods lexical regular expressions do well overall
  39. 39. Precision: % of papers found by our methods that were cited in database submissions fields
  40. 40. Precision: % of papers found by our methods that were cited in database submissions fields lexical regular expressions do well overall, bag-of- words does even better
  41. 41. Precision: % of papers found by our methods that were cited in database submissions fields Precision of simple patterns depends on database
  42. 42. Precision: % of papers found by our methods that were cited in database submissions fields Simple patterns do poorly on the most popular databases (those with the most statements of reuse?)
  43. 43. Precision vs. Recall plot of all methods for each database. Diverse!
  44. 44. Relative strength of methods for this task across databases bag of words <lexical patterns> <accession> <url> “accession”
  45. 45. Limitations: • bias due to manual screening of negatives • database-centric classifier • approach requires computational access to literature full text!
  46. 46. Impact: • A recent version that runs in PubMed Central: • could increase GEO article links by 2.6% • by 5.5% annually when all NIH in PMC • double the recall (to 80%), double these estimates • 40 links already added by GEO staff!
  47. 47. Ongoing work: 1. Continue focusing on methods that use existing full-text query interfaces, like PubMed Central 2. Use this tool to evaluate the patterns and prevalence of biomedical research data sharing and reuse
  48. 48. Thanks to the Dept of Biomedical Informatics at the U of Pittsburgh, the NLM for funding through training grant 5 T15 LM007059-22, and everyone who publishes “gold” open access, thereby facilitates reuse of article full text for studies like this. My shared data: www.dbmi.pitt.edu/piwowar Share your research data too!
  49. 49. Our manual filter for additional positive classifications identified more cases in some databases than others: we reclassified 19% of [article,database] cases from ArrayExpress as positive despite an omitted literature link, compared to 11%, 7%, 2%, and 1% for GEO, Genbank, PDB, and SMD respectively (see Table 2 for raw number of cases). The most common situations included: the database entry listed a citation for another paper by the same authors, the entry listed an erroneous PubMed ID, the entry included a citation without a PubMed ID, or the entry had a blank citation field.
  50. 50. Usage? • scientists looking for datasets for reuse • curators looking for primary citations • researchers studying data sharing behaviour
  51. 51. Regular expression • Precise one + • "(b(accession.{0,20}(for|at).{0,100}(is|are)))", • r"(b(raw|original|our|complete|detailed).{0,20}data)", • r"(b(we|have|is|was|were|is|are|be|have|has|been).(exported|gave|given|listed|provided|reported))" • ]) + ")"
  52. 52. Precise Regular expression • we have has is are was were be been accessioned|added|archived|assigned|deposited|entered|imported|included|inserted|loaded|lodged|placed| posted|provided|registered|reported.to|stored|submitted|uploaded.to))", is|are|will.be|made).{0,20}(available|accessible) (be).(accessed|browsed|downloaded|found|obtained|queried|retrieved|searched|viewed) (through|under|as).{0,20}accession (given)|new|received|assigned).{0,20}(accession) (data.{0,20}availability|for public distribution|for.{0,20}release upon publication|for the.{0,20}data.{0,20} generated|from this study have.{0,20}accession|data.{0,10}from this study|access to.{0,20}data.
  53. 53. Stopwords are important!
  54. 54. Recall
  55. 55. Precision
  56. 56. Evaluation • queried 24 000 articles across 27 journals • 25% mentioned one of the database names • development set of 4434 training set of 2000 test set of 1028
  57. 57. Research data http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  58. 58. Research data http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  59. 59. Research data PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  60. 60. Research data PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  61. 61. Research data PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  62. 62. Research data PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×