Public data archiving:

     Who shares?
    Who doesn’t?
What can we do about it?
               Heather Piwowar
        ...
http://www.metmuseum.org/toah/ho/09/euwf/ho_24.45.1.htm
http://www.flickr.com/photos/jsmjr/62443357/
http://www.flickr.com/photos/camilleharrington/3587294608/
http://www.flickr.com/photos/rkuhnau/3318245976/
http://www.flickr.com/photos/conformpdx/1796399674/
http://www.flickr.com/photos/rkuhnau/3317418699/
http://www.flickr.com/photos/zemlinki/261617721/
http://www.flickr.com/photos/tracenmatt/3020786491/
http://www.flickr.com/photos/the-o/2078239333/
http://www.flickr.com/photos/ryanr/142455033/
http://www.flickr.com/photos/75166820@N00/5318468/
Find
Organize
Document
Deidentify
Format
Decide
Ask
Submit

Answer questions
Worry about mistakes being found
Worry about ...
not very motivating.
As a result, policy makers have spent 
 lots of time and money ....




                      http://www.flickr.com/photos...
building databases, 
developing standards, 
articulating best practices

to support public archiving of 
 research dataset...
lots of data sharing!




                        http://www.genome.jp/en/db_growth.html
but how much isn’t 
 shared?

  what isn’t shared?
              who isn’t sharing it?
why not?
     how much does it matt...
you can not manage 
what you do not measure




               quote: Lord Kelvin
               http://www.flickr.com/pho...
As we seek to embrace and
 encourage data sharing,

understanding patterns of adoption
 will allow us to make informed
 de...
research questions

  1. Is there benefit for those who share?
  2. How can we study data sharing behaviour in
     a scala...
http://www.flickr.com/photos/paulhami/1020538523//
Which data?




              http://www.flickr.com/photos/paulhami/1020538523//
Where?




         http://www.flickr.com/photos/paulhami/1020538523//
With whom?




      http://www.flickr.com/photos/paulhami/1020538523//
When?




        http://www.flickr.com/photos/paulhami/1020538523//
Under what terms?




                http://www.flickr.com/photos/paulhami/1020538523//
http://www.flickr.com/photos/paulhami/1020538523//
http://www.flickr.com/photos/paulhami/1020538523//
• gene expression microarray data
• raw intensity data
• upon publication
• publicly on the internet
• (centralized databa...
http://en.wikipedia.org/wiki/DNA_microarray
   http://en.wikipedia.org/wiki/Image:Heatmap.png
   http://commons.wikimedia....
microarray
      data
1.  Is there benefit for 
 those who share?




                 http://www.flickr.com/photos/sunrise/35819369/
currency of value?

     Citations.
currency of value?

     Citations.

           $50!




                     Diamond,Arthur M. What is a Citation Worth?....
dataset
85 cancer microarray trials published in 1999-2003, as
identified by Ntzani and Ioannidis (2003)

citations
ISI Web...
Note:
 log
 scale
~70%
2. Need automated methods to:

a) Identify studies that create datasets
b) Determine which of these
        have in fact b...
a) Identify studies that create datasets




                                 http://www.flickr.com/photos/lofaesofa/248546...
Look for wetlab methods in article full text:




                         http://www.pubmedcentral.nih.gov/articlerender....
Combined, these full-text portals reach 85%
of the articles available through
U of Pittsburgh library subscriptions.
But how to generate an effective query?
Use open access articles.
• text analysis:
               automatically catalogued
 single words and word-pairs from full text
• assessed precision ...
Derived query:
  ("gene expression" AND microarray AND cell AND rna)

  AND (rneasy OR trizol OR "real-time pcr")

  NOT (...
Evaluation:
Ochsner et al. Nature Methods (2008)
400 studies across 20 journals

Precision: 90% (conf int: 86% to 93%)
Rec...
a) Identify studies that create datasets
b) Determine which of these
        have in fact been shared
c) Extract attribute...
b) Determine which datasets
        have in fact been shared
77 % 
a) Identify studies that create datasets
b) Determine which of these
        have in fact been shared
c) Extract attribute...
Funder   Journal       Investigator   Institution   Study




                   Is research data shared
                 ...
Funder       Journal       Investigator   Institution     Study

funded by     impact         years since   sector        ...
journal rank
journal data sharing policy


          “An inherent principle of publication is that
           others should be able to ...
institution rank




Yu et al. BMC medical
  informatics and decision
  making (2007) vol. 7 pp. 17
study type
author “experience”

Author publication history:

Author name            Author-ity web service
                       Tor...
author gender
funding level

PubMed grant lists   + NIH grant details
funder mandates




     Requires a data sharing plan
     for studies funded after October 2003
     that receive more th...
funder mandates

Proxy for NIH data sharing policy
applicability:

If in any year since 2004,
• funded by an NIH grant num...
and so on...


    124 variables
Now equipped with automated methods to:

a) Identify studies that create datasets
b) Determine which of these
        have...
3.  What factors are correlated 
 with sharing and withholding 
 data?
                     http://www.flickr.com/photos/c...
11,603 datapoints


25% had links from datasets in databases
univariate analysis
Proportion of articles with shared datasets, by year




                                                                 ...
Proportion of datasets shared




                                     0.0
                                           0.2
...
Proportion of datasets shared




                                            0.0
                                        ...
Proportion of datasets shared




       0.0
             0.2
                         0.4
                               ...
multivariate analysis
factor analysis
multivariate logistic regression over
the first-order factors
Multivariate nonlinear regressions with interactions
                                                                     ...
Multivariate nonlinear regressions with interactions
                                                                     ...
logistic regression
using second-order factors
Multivariate nonlinear regression with interactions
                                                 Odds Ratio
          ...
Multivariate nonlinear regression with interactions
                                                 Odds Ratio
          ...
Conclusions:
   • data sharing rates are increasing,
     but overall levels are low

Preliminary evidence:
   • levels ar...
•   data and filters were imperfect
•   many assumptions
•   didn’t capture all types of sharing
•   don’t know how general...
http://www.flickr.com/photos/gatewaystreets/3838452287/
NSF-funded distributed framework
 and cyberinfrastructure for
 environmental science.



Dryad is a repository of data
 un...
1.  new domain
http://www.flickr.com/photos/paulhami/1020538523//
http://www.flickr.com/photos/paulhami/1020538523//
• evolution and ecology
    datasets
•   raw data that support results
•   upon publication
    or short embargo
•   publi...
challenges!

  1. No PubMed
  2. Diverse data types, norms, repositories
  3. Data almost always collected for a specific
 ...
2.  new initiatives
JDAP
       •   The American Naturalist
       •   Evolution
       •   Journal of Evolutionary Biology
       •   Molecul...
Blumenthal et al. Acad Med. 2006
        Campbell et al. JAMA. 2002.
Kyzas et al. J Natl Cancer Inst. 2005.
       Vogeli ...
3.  Reuse.




             http://www.flickr.com/photos/boitabulle/3668162701/
who reuses data?
                  why?
     when?
                       who doesn’t?
which datasets are most likely 
 to...
http://upload.wikimedia.org/wikipedia/commons/thumb/e/e6/
    Gamma_distribution_pdf.svg/500px-Gamma_distribution_pdf.svg....
I post my data, code, and statistical scripts on
GitHub (links from http://researchremix.org)
Share yours too!


         ...
“Does anyone want your data?

That’s hard to predict […]
After all, no one ever knocked on your door asking to
buy those f...
Dept of Biomedical Informatics at U of Pittsburgh
Wendy Chapman for support and feedback
Todd Vision, Mike Whitlock for on...
http://www.flickr.com/photos/jep42/3017149415/in/set-72157608797298056/
Journal
mandates




           variables
• readers
• reusers               perspectives,
• authors        and also driving towards
• editors             actionable...
http://www.flickr.com/photos/sunrise/35819369/
http://www.flickr.com/photos/fboyd/2156630044/
Correlates with self‐reported data 
withholding
            industry involvement
perceived competitiveness of field
      ...
Self‐reported reasons for data 
withholding
               sharing is too much effort
want student or jr faculty to publis...
Table 2: Second-order factor loadings, by first-order factors

                   Amount of NIH funding
                0.8...
Table 3: Second-order factor loadings, by   OA journal & previous GEO-AE sharing
original variables
                      ...
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Upcoming SlideShare
Loading in …5
×

Public data archiving: Who does? Who doesn't? What can we do about it?

741
-1

Published on

Presentation at UBC Biodiversity Internal Seminar Series (BLISS)
http://www.zoology.ubc.ca/~biodiv/BLISS/BLISS.htm

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
741
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Public data archiving: Who does? Who doesn't? What can we do about it?

  1. 1. Public data archiving: Who shares? Who doesn’t? What can we do about it? Heather Piwowar Presented at UBC BLISS, Sept 2010 DataONE postdoc with Dryad and NESCent, @UBC PhD in Dept of Biomedical Informatics, U of Pittsburgh
  2. 2. http://www.metmuseum.org/toah/ho/09/euwf/ho_24.45.1.htm
  3. 3. http://www.flickr.com/photos/jsmjr/62443357/
  4. 4. http://www.flickr.com/photos/camilleharrington/3587294608/
  5. 5. http://www.flickr.com/photos/rkuhnau/3318245976/
  6. 6. http://www.flickr.com/photos/conformpdx/1796399674/
  7. 7. http://www.flickr.com/photos/rkuhnau/3317418699/
  8. 8. http://www.flickr.com/photos/zemlinki/261617721/
  9. 9. http://www.flickr.com/photos/tracenmatt/3020786491/
  10. 10. http://www.flickr.com/photos/the-o/2078239333/
  11. 11. http://www.flickr.com/photos/ryanr/142455033/
  12. 12. http://www.flickr.com/photos/75166820@N00/5318468/
  13. 13. Find Organize Document Deidentify Format Decide Ask Submit Answer questions Worry about mistakes being found Worry about data being misinterpreted Worry about being scooped Forgo money and IP and prestige???
  14. 14. not very motivating.
  15. 15. As a result, policy makers have spent  lots of time and money .... http://www.flickr.com/photos/johnnyvulkan/381941233/ http://www.flickr.com/photos/tonivc/2283676770/
  16. 16. building databases,  developing standards,  articulating best practices to support public archiving of  research datasets 
  17. 17. lots of data sharing! http://www.genome.jp/en/db_growth.html
  18. 18. but how much isn’t  shared? what isn’t shared? who isn’t sharing it? why not? how much does it matter? what can we do  about it?
  19. 19. you can not manage  what you do not measure quote: Lord Kelvin http://www.flickr.com/photos/archeon/2941655917/
  20. 20. As we seek to embrace and encourage data sharing, understanding patterns of adoption will allow us to make informed decisions about tools, policies, and best practices. Measuring adoption over time will allow us to note progress and identify best practices and opportunities for improvement.
  21. 21. research questions 1. Is there benefit for those who share? 2. How can we study data sharing behaviour in a scalable, systematic way? 3. What factors are correlated with sharing and withholding data?
  22. 22. http://www.flickr.com/photos/paulhami/1020538523//
  23. 23. Which data? http://www.flickr.com/photos/paulhami/1020538523//
  24. 24. Where? http://www.flickr.com/photos/paulhami/1020538523//
  25. 25. With whom? http://www.flickr.com/photos/paulhami/1020538523//
  26. 26. When? http://www.flickr.com/photos/paulhami/1020538523//
  27. 27. Under what terms? http://www.flickr.com/photos/paulhami/1020538523//
  28. 28. http://www.flickr.com/photos/paulhami/1020538523//
  29. 29. http://www.flickr.com/photos/paulhami/1020538523//
  30. 30. • gene expression microarray data • raw intensity data • upon publication • publicly on the internet • (centralized databases) http://www.flickr.com/photos/paulhami/1020538523//
  31. 31. http://en.wikipedia.org/wiki/DNA_microarray http://en.wikipedia.org/wiki/Image:Heatmap.png http://commons.wikimedia.org/wiki/ File:DNA_double_helix_vertikal.PNG microarray data
  32. 32. microarray data
  33. 33. 1.  Is there benefit for  those who share? http://www.flickr.com/photos/sunrise/35819369/
  34. 34. currency of value? Citations.
  35. 35. currency of value? Citations. $50! Diamond,Arthur M. What is a Citation Worth?. The Journal of Human Resources (1986) vol. 21 (2) pp. 200-215
  36. 36. dataset 85 cancer microarray trials published in 1999-2003, as identified by Ntzani and Ioannidis (2003) citations ISI Web of Science Citation index, citations from 2004-2005 data sharing locations Publisher and lab websites, microarray databases, WayBack Internet Archive, Oncomine statistics Multivariate linear regression
  37. 37. Note: log scale
  38. 38. ~70%
  39. 39. 2. Need automated methods to: a) Identify studies that create datasets b) Determine which of these have in fact been shared c) Extract attributes about the environment
  40. 40. a) Identify studies that create datasets http://www.flickr.com/photos/lofaesofa/248546821/
  41. 41. Look for wetlab methods in article full text: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1522022&tool=pmcentrez http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1590031&tool=pmcentrez http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1482311&tool=pmcentrez#id331936 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2082469&tool=pmcentrez http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=126870&tool=pmcentrez#id442745
  42. 42. Combined, these full-text portals reach 85% of the articles available through U of Pittsburgh library subscriptions.
  43. 43. But how to generate an effective query? Use open access articles.
  44. 44. • text analysis: automatically catalogued single words and word-pairs from full text • assessed precision and recall • combined the high performers:
  45. 45. Derived query: ("gene expression" AND microarray AND cell AND rna) AND (rneasy OR trizol OR "real-time pcr") NOT (“tissue microarray*” OR “cpg island*”)
  46. 46. Evaluation: Ochsner et al. Nature Methods (2008) 400 studies across 20 journals Precision: 90% (conf int: 86% to 93%) Recall: 56% (conf int: 52% to 61%)
  47. 47. a) Identify studies that create datasets b) Determine which of these have in fact been shared c) Extract attributes about the environment
  48. 48. b) Determine which datasets have in fact been shared
  49. 49. 77 % 
  50. 50. a) Identify studies that create datasets b) Determine which of these have in fact been shared c) Extract attributes about the environment
  51. 51. Funder Journal Investigator Institution Study Is research data shared after publication?
  52. 52. Funder Journal Investigator Institution Study funded by impact years since sector humans? NIH? factor first paper size mice? size of strength of # pubs grant policy impact plants? # citations rank sharing open cancer? plan req’d? access? previously country shared? clinical funded by number of trial? non-NIH? microarray previously reused? number of studies authors published gender year
  53. 53. journal rank
  54. 54. journal data sharing policy “An inherent principle of publication is that others should be able to replicate and build upon the authors' published claims. Therefore, a condition of publication in a Nature journal is that authors are required to make materials, data and associated protocols available in a publicly accessible database …” http://www.nature.com/authors/editorial_policies/availability.html http://www.nature.com/nature/journal/v453/n7197/index.html
  55. 55. institution rank Yu et al. BMC medical informatics and decision making (2007) vol. 7 pp. 17
  56. 56. study type
  57. 57. author “experience” Author publication history: Author name Author-ity web service Torvik & Smalheiser. (2009). Author Name disambiguation: Disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3):11. Citation counts:
  58. 58. author gender
  59. 59. funding level PubMed grant lists + NIH grant details
  60. 60. funder mandates Requires a data sharing plan for studies funded after October 2003 that receive more than $500 000 in direct funding per year
  61. 61. funder mandates Proxy for NIH data sharing policy applicability: If in any year since 2004, • funded by an NIH grant number with a “1” or “2” type code • received more than $750 000 in total funding from the grant
  62. 62. and so on... 124 variables
  63. 63. Now equipped with automated methods to: a) Identify studies that create datasets b) Determine which of these have in fact been shared c) Extract attributes about the environment
  64. 64. 3.  What factors are correlated  with sharing and withholding  data? http://www.flickr.com/photos/cogdog/123072/
  65. 65. 11,603 datapoints 25% had links from datasets in databases
  66. 66. univariate analysis
  67. 67. Proportion of articles with shared datasets, by year 0.35 Proportion of articles with datasets found in GEO or ArrayExpress 0.30 0.25 0.20 0.15 Across time 0.10 0.05 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Year article published
  68. 68. Proportion of datasets shared 0.0 0.2 0.4 0.6 0.8 1.0 Physiol Genomics PLoS Genet Genome Biol Microbiology PLoS One BMC Genomics Plant Cell Genome Res Eukaryot Cell Appl Environ Microbiol BMC Med Genomics Hum Mol Genet Proc Natl Acad Sci U S A Infect Immun Am J Respir Cell Mol Biol Dev Biol J Bacteriol Mol Endocrinol BMC Cancer Plant Physiol Biol Reprod Blood J Immunol FASEB J Toxicol Sci J Exp Bot Nucleic Acids Res Diabetes Mol Cell Biol Mol Cancer Ther BMC Bioinformatics Stem Cells FEBS Lett J Neurosci Am J Pathol J Biol Chem J Virol OTHER Cancer Res J Clin Endocrinol Metab Plant Mol Biol Clin Cancer Res Genomics Journals Invest Ophthalmol Vis Sci Mol Hum Reprod Carcinogenesis Gene Endocrinology Oncogene Cancer Lett Biochem Biophys Res Commun (Physiological Genomics)
  69. 69. Proportion of datasets shared 0.0 0.2 0.4 0.6 0.8 1.0 Stanford University University of Pennsylvania University of Illinois University of California, Los Angeles University of Wisconsin, Madison University of Washington University of California, Davis The University of British Columbia University of California, San Francisco University of Florida University of California, San Diego University of Minnesota, Twin Cities Baylor College of Medicine OTHER Max Planck Gesellschaft Harvard University Duke University Medical Center Yale University Johns Hopkins University University of Pittsburgh (Stanford) Washington University in Saint Louis University of Toronto University of California, Berkeley University of Michigan, Ann Arbor Michigan State University Institutions National Cancer Institute Tokyo Daigaku
  70. 70. Proportion of datasets shared 0.0 0.2 0.4 0.6 0.8 1.0 1 101 201 301 401 501 601 701 801 901 1001 1101 1201 1301 rank 1401 1501 1601 1701 1801 1901 Institution
  71. 71. multivariate analysis
  72. 72. factor analysis
  73. 73. multivariate logistic regression over the first-order factors
  74. 74. Multivariate nonlinear regressions with interactions Odds Ratio 0.25 0.50 1.00 2.00 4.00 8.00 Has journal policy Multivariate nonlinear regressions with interactions Count of R01 & other NIH grants Odds Ratio 0.95 0.25 0.50 1.00 2.00 4.00 8.00 Authors prev GEOAE sharing & OA & microarray creation Has journal policy NO K funding other P funding Count of R01 & or NIH grants 0.95 Authors prev GEOAE sharing & OA & microarray creation NO K Journalfunding funding or P impact Institution high citations & collaboration Journal policy consequences & Journal impact long halflife Journal policy consequences & long halflife Institution high citations NOTcollaboration & animals or mice Instititution is government & NOT higher ed NOT animals or mice Last author num prev pubs & first year pub Large NIH grant Instititution is government & NOT higher ed Humans & cancer NO geo reuse + YES high institution output Last author num prev pubs & first year pub First author num prev pubs & first year pub Large NIH grant Humans & cancer NO geo reuse + YES high institution output First author num prev pubs & first year pub
  75. 75. Multivariate nonlinear regressions with interactions Odds Ratio 0.25 0.50 1.00 2.00 4.00 8.00 Has journal policy Multivariate nonlinear regressions with interactions Count of R01 & other NIH grants Odds Ratio 0.95 0.25 0.50 1.00 2.00 4.00 8.00 Authors prev GEOAE sharing & OA & microarray creation Has journal policy NO K funding other P funding Count of R01 & or NIH grants 0.95 Authors prev GEOAE sharing & OA & microarray creation NO K Journalfunding funding or P impact Institution high citations & collaboration Journal policy consequences & Journal impact long halflife Journal policy consequences & long halflife Institution high citations NOTcollaboration & animals or mice Instititution is government & NOT higher ed NOT animals or mice Last author num prev pubs & first year pub Large NIH grant Instititution is government & NOT higher ed Humans & cancer NO geo reuse + YES high institution output Last author num prev pubs & first year pub First author num prev pubs & first year pub Large NIH grant Humans & cancer NO geo reuse + YES high institution output First author num prev pubs & first year pub
  76. 76. logistic regression using second-order factors
  77. 77. Multivariate nonlinear regression with interactions Odds Ratio 0.25 0.50 1.00 2.00 4.00 OA journal & previous GEO-AE sharing Amount of NIH funding 0.95 Journal impact factor and policy Higher Ed in USA Cancer & humans
  78. 78. Multivariate nonlinear regression with interactions Odds Ratio 0.25 0.50 1.00 2.00 4.00 OA journal & previous GEO-AE sharing Amount of NIH funding 0.95 Journal impact factor and policy Higher Ed in USA Cancer & humans
  79. 79. Conclusions: • data sharing rates are increasing, but overall levels are low Preliminary evidence: • levels are particularly low in cancer • levels are highest for those who • publish in a journal with a policy • publish in an open access journal • have shared data before
  80. 80. • data and filters were imperfect • many assumptions • didn’t capture all types of sharing • don’t know how generalizable across datatypes • should be considered hypothesis-generating http://www.flickr.com/photos/vlastula/300102949/
  81. 81. http://www.flickr.com/photos/gatewaystreets/3838452287/
  82. 82. NSF-funded distributed framework and cyberinfrastructure for environmental science. Dryad is a repository of data underlying scientific publications, with an initial focus on evolution, ecology, and related fields. The National Evolutionary Synthesis Center, NSF-funded: • Duke University, • UNC at Chapel Hill • North Carolina State University
  83. 83. 1.  new domain
  84. 84. http://www.flickr.com/photos/paulhami/1020538523//
  85. 85. http://www.flickr.com/photos/paulhami/1020538523//
  86. 86. • evolution and ecology datasets • raw data that support results • upon publication or short embargo • publicly on the internet http://www.flickr.com/photos/paulhami/1020538523//
  87. 87. challenges! 1. No PubMed 2. Diverse data types, norms, repositories 3. Data almost always collected for a specific hypothesis 4. Less public sharing so far
  88. 88. 2.  new initiatives
  89. 89. JDAP • The American Naturalist • Evolution • Journal of Evolutionary Biology • Molecular Ecology • Evolutionary Applications • Genetics • Heredity • Molecular Biology and Evolution • Systematic Biology • Paleobiology • BMC Evolutionary Biology
  90. 90. Blumenthal et al. Acad Med. 2006 Campbell et al. JAMA. 2002. Kyzas et al. J Natl Cancer Inst. 2005. Vogeli et al. Acad Med. 2006. Reidpath et al. Bioethics 2001. http://www.flickr.com/photos/jima/606588905/
  91. 91. 3.  Reuse. http://www.flickr.com/photos/boitabulle/3668162701/
  92. 92. who reuses data? why? when? who doesn’t? which datasets are most likely  to be reused? how many datasets could be  reused but aren’t? why aren’t they? does it matter? what can we do  about it?
  93. 93. http://upload.wikimedia.org/wikipedia/commons/thumb/e/e6/ Gamma_distribution_pdf.svg/500px-Gamma_distribution_pdf.svg.png
  94. 94. I post my data, code, and statistical scripts on GitHub (links from http://researchremix.org) Share yours too! http://www.flickr.com/photos/myklroventine/892446624/
  95. 95. “Does anyone want your data? That’s hard to predict […] After all, no one ever knocked on your door asking to buy those figurines collecting dust in your cabinet before you listed them on eBay. Your data, too, may simply be awaiting an effective matchmaker.” Got data? Nature Neuroscience (2007)
  96. 96. Dept of Biomedical Informatics at U of Pittsburgh Wendy Chapman for support and feedback Todd Vision, Mike Whitlock for ongoing discussions NIH NLM. NSF through DataONE, NESCent, Dryad. Open science online community and those who release their articles, datasets and photos openly thank you
  97. 97. http://www.flickr.com/photos/jep42/3017149415/in/set-72157608797298056/
  98. 98. Journal mandates variables
  99. 99. • readers • reusers perspectives, • authors and also driving towards • editors actionable results for these groups • reviewers • funders • database designers, maintainers, curators • patients, subjects, or populations
  100. 100. http://www.flickr.com/photos/sunrise/35819369/ http://www.flickr.com/photos/fboyd/2156630044/
  101. 101. Correlates with self‐reported data  withholding industry involvement perceived competitiveness of field male sharing discouraged in training human participants academic productivity 0 1 2 3 Blumenthal et al. Acad Med. 2006
  102. 102. Self‐reported reasons for data  withholding sharing is too much effort want student or jr faculty to publish more they themselves want to publish more cost industrial sponsor confidentiality commercial value of results 0% 20% 40% 60% 80% Campbell et al. JAMA 2002.
  103. 103. Table 2: Second-order factor loadings, by first-order factors Amount of NIH funding 0.88 Count of R01 & other NIH grants 0.49 Large NIH grant -0.55 NO K funding or P funding Cancer & humans 0.83 Humans & cancer OA journal & previous GEO-AE sharing 0.59 Authors prev GEOAE sharing & OA & microarray creation 0.43 Institution high citations & collaboration 0.31 First author num prev pubs & first year pub -0.36 Last author num prev pubs & first year pub Journal impact factor and policy 0.57 Journal impact 0.51 Last author num prev pubs & first year pub Higher Ed in USA 0.40 NO geo reuse + YES high institution output -0.44 Institution is government & NOT higher ed
  104. 104. Table 3: Second-order factor loadings, by OA journal & previous GEO-AE sharing original variables 0.40 first.author.num.prev.geoae.sharing.tr Amount of NIH funding 0.37 pubmed.is.open.access 0.87 nih.cumulative.years.tr 0.37 first.author.num.prev.oa.tr 0.85 num.grants.via.nih.tr 0.35 last.author.num.prev.geoae.sharing.tr 0.84 max.grant.duration.tr 0.32 pubmed.is.effectiveness 0.82 num.grant.numbers.tr 0.32 last.author.num.prev.oa.tr 0.80 pubmed.is.funded.nih 0.31 pubmed.is.geo.reuse 0.79 nih.max.max.dollars.tr -0.38 country.japan 0.70 nih.sum.avg.dollars.tr 0.70 nih.sum.sum.dollars.tr Journal impact factor and policy 0.59 has.R.funding 0.48 journal.impact.factor.log 0.59 num.post2003.morethan500k.tr 0.47 jour.policy.requires.microarray.accession 0.58 country.usa 0.46 jour.policy.mentions.exceptions 0.58 has.U.funding 0.46 pubmed.num.cites.from.pmc.tr 0.57 has.R01.funding 0.45 journal.5yr.impact.factor.log 0.55 num.post2003.morethan750k.tr 0.45 jour.policy.contains.word.miame.mged 0.53 has.T.funding 0.42 last.author.num.prev.pmc.cites.tr 0.53 num.post2003.morethan1000k.tr 0.41 jour.policy.requests.accession 0.49 num.post2004.morethan500k.tr 0.40 journal.immediacy.index.log 0.45 num.post2004.morethan750k.tr 0.40 journal.num.articles.2008.tr 0.44 has.P.funding 0.39 years.ago.tr 0.43 num.post2004.morethan1000k.tr 0.36 jour.policy.says.must.deposit 0.43 num.nih.is.nci.tr 0.35 pubmed.num.cites.from.pmc.per.year 0.35 num.post2005.morethan500k.tr 0.33 institution.mean.norm.citation.score 0.32 num.nih.is.nigms.tr 0.32 last.author.year.first.pub.ago.tr 0.31 num.post2005.morethan750k.tr 0.31 country.usa 0.31 last.author.num.prev.pubs.tr Cancer & humans 0.31 jour.policy.contains.word.microarray 0.60 pubmed.is.cancer -0.31 pubmed.is.open.access 0.59 pubmed.is.humans 0.52 pubmed.is.cultured.cells Higher Ed in USA 0.43 pubmed.is.core.clinical.journal 0.36 institution.stanford 0.39 institution.is.medical 0.36 institution.is.higher.ed -0.58 pubmed.is.plants 0.35 country.usa -0.50 pubmed.is.fungi 0.35 has.R.funding -0.37 pubmed.is.shared.other 0.33 has.R01.funding -0.30 pubmed.is.bacteria 0.30 institution.harvard -0.37 institution.is.govnt

×