Measuring progress
toward a cultural norm of
  shared (and reused!)
biomedical research data
          Heather Piwowar

  ...
Sharing research data




http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki...
Sharing research data




http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki...
Sharing research data




         PAST MEDICAL HISTORY:
         Past medical history showed she had
         superficial...
Sharing research data




         PAST MEDICAL HISTORY:
         Past medical history showed she had
         superficial...
Sharing research data




         PAST MEDICAL HISTORY:
         Past medical history showed she had
         superficial...
Sharing research data




         PAST MEDICAL HISTORY:
         Past medical history showed she had
         superficial...
Shared data benefits science
 Verify
 Understand
 Extend
 Explore
 Combine
 Synergize
 Train
 Reduce
But... costly for authors
    Find
    Organize
    Document
    Deidentify
    Format
    Decide
    Ask
    Submit

    ...
As a result, policy makers have spent 
 lots of time and money ....




                      http://www.flickr.com/photos...
... on initiatives, requests, 
  requirements, and tools
     NIH data sharing plan requirement

     Journal requirements...
http://www.flickr.com/photos/mesh/14102209/
lots of data sharing!




                        http://www.genome.jp/en/db_growth.html
but how much isn’t 
 shared?

  what isn’t shared?
              who isn’t sharing it?
why not?
     how much does it matt...
you can not manage 
what you do not measure




               http://www.flickr.com/photos/archeon/2941655917/
research questions

  1. Is there benefit for those who share?
  2. Do journal policies increase rates of sharing?
  3. Wha...
http://en.wikipedia.org/wiki/DNA_microarray
   http://en.wikipedia.org/wiki/Image:Heatmap.png
   http://commons.wikimedia....
microarray
      data
1. Is there benefit for 
 those who share?




                 http://www.flickr.com/photos/sunrise/35819369/
currency of value?

     Citations.

           $50!




                     Diamond,Arthur M. What is a Citation Worth?....
Prior work focused on the citation
 advantage of an open access
 publishing model.

Our question: are articles that share
...
dataset
85 cancer microarray trials published in 1999-2003, as
identified by Ntzani and Ioannidis (2003)

citations
ISI Web...
Note:
 log
 scale
In multivariate regression, we found studies
that had made their data publicly available
received 69% more citations than ...
future work
     • collect a larger dataset for citation
       analysis (stay tuned)

     • investigate other datatypes
...
2. Do journal data sharing 
 policies increase sharing?




                 http://www.flickr.com/photos/ryanr/142455033/
“An inherent principle of
 publication is that others
 should be able to replicate and
 build upon the authors'
 published...
Prior work examined data sharing
 policies in biomedicine, but these
 reviews are now dated,
 consider a variety of resour...
Our aim: look at data sharing policies
 within Instruction to Author
 statements of 70 journals, as they
 apply to gene ex...
content of data sharing policies

   Very diverse policies in terms of:
    •   statements of policy motivation
    •   da...
strength of data sharing policies

    No applicable policy (43%)


    Weak policy (24%)
      should, recommend, request...
strength of data sharing policies
multivariate associations
                                         •! Biochemistry
     ...
strength of data sharing policies
associated with impact factor
                   High-impact journals
                  ...
data sharing policies
associated with amount of sharing

     For each of the 70 journals,
         we measured the percen...
data sharing policies
associated with amount of sharing

           Having a data-sharing policy?   •! Genetics &
        ...
•   our corpus of “gene expression microarray” articles
    may have included some that reused data and did not
    themse...
future work on journal policies

    • use a more precise filter to isolate
      data producing articles and thereby
     ...
future work on funder policies

    • are they effective? (stay tuned)
    • what do people propose in data
      sharing ...
3. What other factors are 
 correlated with sharing 
 and withholding data?




                   http://www.flickr.com/p...
Prior work has focused on surveys and
studies of intention.


Our aim: measure associations between
observed data sharing ...
pilot dataset


  Ochsner et al. manually reviewed 20 journals for 2007:
       400 studies
       200 shared their microa...
pilot variables

                          Journal
  Funder     Journal                     Investigator
                 ...
funder mandates



 NIH 2003 Data Sharing Requirement

 Requires a data sharing plan
 for studies funded after October 200...
funder mandates


 Assumed data sharing requirement was applicable if:
 the NIH grant numbers associated with PubMed entry...
author experience
   Publication history and impact proxy

   First and last authors:
   • years since first paper
   • h-i...
author experience
Derived h-index (pubmedi citation indices):

 Author publication
 history:

 Author name           Autho...
pilot variables

                          Journal
  Funder     Journal                     Investigator
                 ...
stats

    Univariate odds ratios
    Multivariate logistic regression
results of pilot
 Not statistically significant             Statistically significant



                                 ...
results of pilot


                   33%
results of pilot
results of pilot
results of pilot
results of pilot
results of pilot
results of pilot
PhD dissertation

  More samples,
  more variables




                   http://www.flickr.com/photos/krcla/2069243613/
More samples:

  Developed and evaluated automated
  methods to:

   • Identify studies that generate datasets that
    co...
To identify studies that generate datasets,

use a query on the full text of published articles:
  ("gene expression" AND ...
To determine which articles have shared data,

use a query on the full text of published articles:
  pubmed_gds[filter] and...
More variables:

  Use PubMed and a variety of other internet
  resources...
Funder       Journal       Investigator   Institution     Study

funded by     impact         years since   sector        ...
stats

    Univariate odds ratios
    Multivariate logistic regression
    Exploratory factor analysis
results?




           http://www.flickr.com/photos/skrb/2427171774/
research questions

  1. Is there benefit for those who share?
  2. Do journal policies increase rates of sharing?
  3. Wha...
what’s next?
future work previously mentioned...

     • citation analysis of larger cohort
     • journal policies with refined filter
 ...
Reuse.




         http://www.flickr.com/photos/boitabulle/3668162701/
who reuses data?
                  why?
      when?
                     who doesn’t?
 which datasets are most likely 
  t...
One possible reuse research agenda

  1. Inventory reuse acknowlegement patterns
  2. Build full-text and metadata filters ...
Resources

 • GEO list of reuse
   articles (currently 618)
 • Previous work in citation context
   classification
 • Amazo...
Stakeholders
  • readers
  • reusers             For their perspectives,

  • authors           and also to design studies...
Data sharing plan



  I post my data, code, and statistical scripts at
  http://www.dbmi.pitt.edu/piwowar
  Share yours t...
Dept of Biomedical Informatics at U of Pittsburgh
NLM for training grant funding
Open science online community and those w...
“Does anyone want your data?

 That’s hard to predict[…]


 After all, no one ever knocked on your door asking to buy
 tho...
Journal
mandates




           variables
Correlates with self‐reported data 
withholding
            industry involvement
perceived competitiveness of field
       ...
Self‐reported reasons for data 
withholding
               sharing is too much effort
want student or jr faculty to publis...
Prevalence of data withholding 
via surveys
 self-reported denying a request in last 3 years

      trainees self-reported...
NESCent visit:  Measuring progress toward a cultural norm of shared (and reused!) biomedical research data
NESCent visit:  Measuring progress toward a cultural norm of shared (and reused!) biomedical research data
NESCent visit:  Measuring progress toward a cultural norm of shared (and reused!) biomedical research data
Upcoming SlideShare
Loading in …5
×

NESCent visit: Measuring progress toward a cultural norm of shared (and reused!) biomedical research data

1,417
-1

Published on

Preliminary work and future directions in measuring biomedical research data sharing

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,417
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

NESCent visit: Measuring progress toward a cultural norm of shared (and reused!) biomedical research data

  1. 1. Measuring progress toward a cultural norm of shared (and reused!) biomedical research data Heather Piwowar Department of Biomedical Informatics University of Pittsburgh
  2. 2. Sharing research data http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  3. 3. Sharing research data http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  4. 4. Sharing research data PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  5. 5. Sharing research data PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  6. 6. Sharing research data PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  7. 7. Sharing research data PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  8. 8. Shared data benefits science Verify Understand Extend Explore Combine Synergize Train Reduce
  9. 9. But... costly for authors Find Organize Document Deidentify Format Decide Ask Submit Answer questions Worry about mistakes being found Worry about data being misinterpreted Worry about being scooped Forgo money and IP and prestige???
  10. 10. As a result, policy makers have spent  lots of time and money .... http://www.flickr.com/photos/johnnyvulkan/381941233/ http://www.flickr.com/photos/tonivc/2283676770/
  11. 11. ... on initiatives, requests,  requirements, and tools NIH data sharing plan requirement Journal requirements Public databases Data sharing grids like BIRN and caBIG Data formatting standards Editorials, letters to the editor, discussion....
  12. 12. http://www.flickr.com/photos/mesh/14102209/
  13. 13. lots of data sharing! http://www.genome.jp/en/db_growth.html
  14. 14. but how much isn’t  shared? what isn’t shared? who isn’t sharing it? why not? how much does it matter? what can we do  about it?
  15. 15. you can not manage  what you do not measure http://www.flickr.com/photos/archeon/2941655917/
  16. 16. research questions 1. Is there benefit for those who share? 2. Do journal policies increase rates of sharing? 3. What other factors are correlated with sharing and withholding data?
  17. 17. http://en.wikipedia.org/wiki/DNA_microarray http://en.wikipedia.org/wiki/Image:Heatmap.png http://commons.wikimedia.org/wiki/ File:DNA_double_helix_vertikal.PNG microarray data
  18. 18. microarray data
  19. 19. 1. Is there benefit for  those who share? http://www.flickr.com/photos/sunrise/35819369/
  20. 20. currency of value? Citations. $50! Diamond,Arthur M. What is a Citation Worth?. The Journal of Human Resources (1986) vol. 21 (2) pp. 200-215
  21. 21. Prior work focused on the citation advantage of an open access publishing model. Our question: are articles that share their raw research data cited more than articles that don’t?
  22. 22. dataset 85 cancer microarray trials published in 1999-2003, as identified by Ntzani and Ioannidis (2003) citations ISI Web of Science Citation index, citations from 2004-2005 data sharing locations Publisher and lab websites, microarray databases, WayBack Internet Archive, Oncomine statistics Multivariate linear regression
  23. 23. Note: log scale
  24. 24. In multivariate regression, we found studies that had made their data publicly available received 69% more citations than similar studies that did not share their data (95% confidence interval: 18% to 143%) Piwowar, Day and Fridsma (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308
  25. 25. future work • collect a larger dataset for citation analysis (stay tuned) • investigate other datatypes • examine citation context
  26. 26. 2. Do journal data sharing  policies increase sharing? http://www.flickr.com/photos/ryanr/142455033/
  27. 27. “An inherent principle of publication is that others should be able to replicate and build upon the authors' published claims. Therefore, a condition of publication in a Nature journal is that authors are required to make materials, data and associated protocols available in a publicly accessible database …” http://www.nature.com/authors/editorial_policies/availability.html http://www.nature.com/nature/journal/v453/n7197/index.html
  28. 28. Prior work examined data sharing policies in biomedicine, but these reviews are now dated, consider a variety of resources, and don’t correlate policy to behaviour. McCain. Science Communication, Vol. 16, No. 4. (1 June 1995), pp. 403-431 NAS. Sharing Publication-Related Data and Materials. (2003), p. 33
  29. 29. Our aim: look at data sharing policies within Instruction to Author statements of 70 journals, as they apply to gene expression microarray data.
  30. 30. content of data sharing policies Very diverse policies in terms of: • statements of policy motivation • datatype-specific policies • requested vs. required • data location • data format • data completeness • timeliness of sharing • consequences for not sharing • exceptions
  31. 31. strength of data sharing policies No applicable policy (43%) Weak policy (24%) should, recommend, request must, but without database accession number Strong policy (33%) must, required, condition of publication requires database accession number
  32. 32. strength of data sharing policies multivariate associations •! Biochemistry &Molecular Biology Impact Open Society •! Oncology Factor Access? Publisher? Journal has a data sharing policy?
  33. 33. strength of data sharing policies associated with impact factor High-impact journals tend to have a strong data-sharing policy
  34. 34. data sharing policies associated with amount of sharing For each of the 70 journals, we measured the percent of articles that were cited from within GEO and ArrayExpress. We considered this a proxy for percent of articles with shared data.
  35. 35. data sharing policies associated with amount of sharing Having a data-sharing policy? •! Genetics & Heredity Impact Open Society •! Multidisciplinary Factor Access? Publisher? Sciences % of articles with shared data
  36. 36. • our corpus of “gene expression microarray” articles may have included some that reused data and did not themselves produce primary data • these results should be considered preliminary, pending a more precise filter (stay tuned) http://www.flickr.com/photos/vlastula/300102949/
  37. 37. future work on journal policies • use a more precise filter to isolate data producing articles and thereby understand the absolute levels of data sharing • investigate other datatypes • look at associations with reviewer instructions and opinions
  38. 38. future work on funder policies • are they effective? (stay tuned) • what do people propose in data sharing plans? Do they do what they propose? Why not? • quantify the perceived worth of data sharing plans and accomplishments in funding and promotion decisions
  39. 39. 3. What other factors are  correlated with sharing  and withholding data? http://www.flickr.com/photos/cogdog/123072/
  40. 40. Prior work has focused on surveys and studies of intention. Our aim: measure associations between observed data sharing behaviour and environmental variables Blumenthal et al. Acad Med. 2006 Campbell et al. JAMA. 2002 Kyzas et al. J Natl Cancer Inst. 2005 Vogeli et al. Acad Med. 2006 Reidpath et al. Bioethics 2001
  41. 41. pilot dataset Ochsner et al. manually reviewed 20 journals for 2007: 400 studies 200 shared their microarray data Ochsner et al. (2008). Much room for improvement in deposition rates of expression microarray datasets. Nature Methods, 5(12), 991.
  42. 42. pilot variables Journal Funder Journal Investigator impact mandates mandates “experience” factor Is research data shared after publication?
  43. 43. funder mandates NIH 2003 Data Sharing Requirement Requires a data sharing plan for studies funded after October 2003 that receive more than $500 000 in direct funding per year
  44. 44. funder mandates Assumed data sharing requirement was applicable if: the NIH grant numbers associated with PubMed entry had $750 000 in total funding any year since 2004 plus a NIH grant number with a leading “1” or “2” since 2004
  45. 45. author experience Publication history and impact proxy First and last authors: • years since first paper • h-index (the largest number N such that an author has N papers cited at least N times) • a-index
  46. 46. author experience Derived h-index (pubmedi citation indices): Author publication history: Author name Author-ity web service Torvik & Smalheiser. (2009). Author Name disambiguation: Disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3):11. Citation counts:
  47. 47. pilot variables Journal Funder Journal Investigator impact mandates mandates “experience” factor Is research data shared after publication?
  48. 48. stats Univariate odds ratios Multivariate logistic regression
  49. 49. results of pilot Not statistically significant Statistically significant Journal Funder Journal Investigator impact mandates mandates “experience” factor Is research data shared after publication?
  50. 50. results of pilot 33%
  51. 51. results of pilot
  52. 52. results of pilot
  53. 53. results of pilot
  54. 54. results of pilot
  55. 55. results of pilot
  56. 56. results of pilot
  57. 57. PhD dissertation More samples, more variables http://www.flickr.com/photos/krcla/2069243613/
  58. 58. More samples: Developed and evaluated automated methods to: • Identify studies that generate datasets that could potentially be shared • Determine which of these have in fact been shared
  59. 59. To identify studies that generate datasets, use a query on the full text of published articles: ("gene expression" AND microarray AND cell AND rna) AND (rneasy OR trizol OR "real-time pcr") NOT (“tissue microarray*” OR “cpg island*”)
  60. 60. To determine which articles have shared data, use a query on the full text of published articles: pubmed_gds[filter] and query ArrayExpress
  61. 61. More variables: Use PubMed and a variety of other internet resources...
  62. 62. Funder Journal Investigator Institution Study funded by impact years since sector humans? NIH? factor first paper size mice? size of strength of h-index grant policy impact plants? a-index rank sharing open cancer? plan req’d? access? previously country shared? clinical funded by number of trial? non-NIH? microarray previously reused? number of studies authors published gender year
  63. 63. stats Univariate odds ratios Multivariate logistic regression Exploratory factor analysis
  64. 64. results? http://www.flickr.com/photos/skrb/2427171774/
  65. 65. research questions 1. Is there benefit for those who share? 2. Do journal policies increase rates of sharing? 3. What other factors are correlated with sharing and withholding data?
  66. 66. what’s next?
  67. 67. future work previously mentioned... • citation analysis of larger cohort • journal policies with refined filter • beyond microarray data • deeper into journal and funder policies • and, finally....
  68. 68. Reuse. http://www.flickr.com/photos/boitabulle/3668162701/
  69. 69. who reuses data? why? when? who doesn’t? which datasets are most likely  to be reused? how many datasets could be  reused but aren’t? why aren’t they? what can we do  about it?
  70. 70. One possible reuse research agenda 1. Inventory reuse acknowlegement patterns 2. Build full-text and metadata filters to identify instances of data reuse 3. Analyze patterns in data reuse choices 4. Survey data producers and data consumers to augment with intentions and perspectives
  71. 71. Resources • GEO list of reuse articles (currently 618) • Previous work in citation context classification • Amazon Mechanical Turk for annotation • Experimental Philosophy for insight into cultural norms • ... Teufel et al. (2006) Automatic classification of citation function. EMNLP.
  72. 72. Stakeholders • readers • reusers For their perspectives, • authors and also to design studies that have actionable results for these groups • editors • reviewers • funders • database designers, maintainers, curators • patients, subjects, or populations
  73. 73. Data sharing plan I post my data, code, and statistical scripts at http://www.dbmi.pitt.edu/piwowar Share yours too! http://www.flickr.com/photos/myklroventine/892446624/
  74. 74. Dept of Biomedical Informatics at U of Pittsburgh NLM for training grant funding Open science online community and those who release their articles, datasets and photos openly Dr Wendy Chapman for her support and feedback thank you
  75. 75. “Does anyone want your data? That’s hard to predict[…] After all, no one ever knocked on your door asking to buy those figurines collecting dust in your cabinet before you listed them on eBay. Your data, too, may simply be awaiting an effective matchmaker.” Got data? Nature Neuroscience 10, 931 (2007)
  76. 76. Journal mandates variables
  77. 77. Correlates with self‐reported data  withholding industry involvement perceived competitiveness of field male sharing discouraged in training human participants academic productivity 0 1 2 3 Blumenthal et al. Acad Med. 2006
  78. 78. Self‐reported reasons for data  withholding sharing is too much effort want student or jr faculty to publish more they themselves want to publish more cost industrial sponsor confidentiality commercial value of results 0% 20% 40% 60% 80% Campbell et al. JAMA 2002.
  79. 79. Prevalence of data withholding  via surveys self-reported denying a request in last 3 years trainees self-reported denying a request been denied access to data, materials, code authors “not able to retrieve raw data” not willing to release data 0% 10% 20% 30% 40% Campbell et al. JAMA. 2002. Kyzas et al. J Natl Cancer Inst. 2005. Vogeli et al. Acad Med. 2006. Reidpath et al. Bioethics 2001.

×