NEDCC 2010 Piwowar Leaders and Laggards

905 views

Published on

"Leaders and Laggards in the preservation of raw biomedical research data" presented at NEDCC 2010, The Tectonics of Digital Curation
A Symposium on the Shifting Preservation and Access Landscape

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
905
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

NEDCC 2010 Piwowar Leaders and Laggards

  1. 1. Leaders and Laggards in the preservation of raw biomedical research data Heather Piwowar Department of Biomedical Informatics University of Pittsburgh Soon‐to‐be Postdoctoral Associate with  Data Observation Network for Earth (DataONE)  
  2. 2. http://www.metmuseum.org/toah/ho/09/euwf/ho_24.45.1.htm
  3. 3. http://www.flickr.com/photos/jsmjr/62443357/
  4. 4. http://www.flickr.com/photos/camilleharrington/3587294608/
  5. 5. http://www.flickr.com/photos/rkuhnau/3318245976/
  6. 6. http://www.flickr.com/photos/conformpdx/1796399674/
  7. 7. http://www.flickr.com/photos/rkuhnau/3317418699/
  8. 8. http://www.flickr.com/photos/zemlinki/261617721/
  9. 9. http://www.flickr.com/photos/tracenmatt/3020786491/
  10. 10. http://www.flickr.com/photos/the-o/2078239333/
  11. 11. Researchers have a choice http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  12. 12. Researchers have a choice http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  13. 13. Researchers have a choice PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  14. 14. Researchers have a choice PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  15. 15. Researchers have a choice PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  16. 16. Researchers have a choice PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  17. 17. Shared data benefits science Verify Understand Extend Explore Combine Synergize Train Reduce
  18. 18. But... costly for authors Find Organize Document Deidentify Format Decide Ask Submit Answer questions Worry about mistakes being found Worry about data being misinterpreted Worry about being scooped Forgo money and IP and prestige???
  19. 19. http://www.flickr.com/photos/75166820@N00/5318468/
  20. 20. As a result, policy makers have spent  lots of time and money .... http://www.flickr.com/photos/johnnyvulkan/381941233/ http://www.flickr.com/photos/tonivc/2283676770/
  21. 21. ... on initiatives, requests,  requirements, and tools • Funder requirements • Journal requirements • Public databases • Data sharing grids • Data formatting standards • Peer encouragement in editorials, letters to the editor...
  22. 22. Does it work? http://www.flickr.com/photos/archeon/2941655917/
  23. 23. lots of data sharing! http://www.genome.jp/en/db_growth.html
  24. 24. but how much isn’t  shared? what isn’t shared? who isn’t sharing it? why not? how much does it matter? what can we do  about it?
  25. 25. http://www.flickr.com/photos/paulhami/1020538523//
  26. 26. who what when where why how http://www.flickr.com/photos/ryanr/142455033/
  27. 27. Who to share data with? • everyone on the internet • “qualified” researchers for “qualified” research projects • friends • your lab
  28. 28. What data is shared? • everything • all the datapoints • all the research notes • code • just what is needed to reproduce the results in the paper • raw? cleaned? every processing step?
  29. 29. When is the data shared? • upon collection • upon submission for publication • upon publication • time-embargo after publication • upon retirement or death
  30. 30. Where is it deposited? • centralized datatype specific repositories • journal supplementary information • institutional repositories • disciplinary repositories
  31. 31. Why share it?
  32. 32. How to share it? • massive datasets • syntactic format • semantic format • sensitive data (privacy, endangered species locations, security- related, ...) • what license or community norm
  33. 33. http://www.flickr.com/photos/paulhami/1020538523//
  34. 34. http://www.flickr.com/photos/paulhami/1020538523//
  35. 35. • biomedical data • few privacy concerns raw data (not images or processed) • openly on the internet • upon publication • datasets are large but manageable • datatypes with mature standards for semantics, syntax, locations http://www.flickr.com/photos/paulhami/1020538523//
  36. 36. but how much isn’t  shared? what isn’t shared? who isn’t sharing it? why not? how much does it matter? what can we do  about it?
  37. 37. but how much isn’t  shared? what isn’t shared? who isn’t sharing it? why not? how much does it matter? what can we do  about it?
  38. 38. Data sharing frequency depends  on how you ask 10% 25-40% Campbell et al. JAMA. 2002. Kyzas et al. J Natl Cancer Inst. 2005. Vogeli et al. Acad Med. 2006. Reidpath et al. Bioethics 2001.
  39. 39. Data sharing frequency depends  on datatype DNA sequences gene expression microarrays proteomics spectra 0% 25% 50% 75% 100% Noor et al. PLoS Biology 2006. Ochsner et al. Nature Methods 2008. Piwowar et al. PLoS ONE 2007. Editorial. Nature Biotech 2007.
  40. 40. Data sharing frequency depends  on when the data was published 40% 30% 20% 10% 0% 2000 01 02 03 04 05 06 07 08 2009
  41. 41. lots of data sharing! http://www.genome.jp/en/db_growth.html
  42. 42. Data sharing frequency depends  on when the data was published 40% 30% 20% 10% 0% 2000 01 02 03 04 05 06 07 08 2009
  43. 43. but how much isn’t  shared? what isn’t shared? who isn’t sharing it? why not? how much does it matter? what can we do  about it?
  44. 44. http://en.wikipedia.org/wiki/DNA_microarray http://en.wikipedia.org/wiki/Image:Heatmap.png http://commons.wikimedia.org/wiki/ File:DNA_double_helix_vertikal.PNG microarray data
  45. 45. Funder Journal Investigator Institution Study How often was research data shared upon publication?
  46. 46. How often was research data shared upon publication? Number of studies that share their data = _____________________________________ Number of studies that create data
  47. 47. How often was research data shared upon publication? Number of studies that share their data = _____________________________________ Number of studies that create data
  48. 48. http://www.flickr.com/photos/lofaesofa/248546821/
  49. 49. Look for wetlab methods in full text: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1522022&tool=pmcentrez http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1590031&tool=pmcentrez http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1482311&tool=pmcentrez#id331936 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2082469&tool=pmcentrez http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=126870&tool=pmcentrez#id442745
  50. 50. Query the full text of published articles: ("gene expression" AND microarray AND cell AND rna) AND (rneasy OR trizol OR "real-time pcr") NOT (“tissue microarray*” OR “cpg island*”)
  51. 51. How often was research data shared upon publication? Number of studies that share their data = _____________________________________ Number of studies that create data
  52. 52. Querying databases for citation links to data creation studies
  53. 53. How often was research data shared upon publication? Number of studies that share their data = _____________________________________ Number of studies that create data
  54. 54. results 11,603 studies that create data we found shared datasets for 25%
  55. 55. Data sharing frequency depends  on when the data was published 40% 30% 20% 10% 0% 2000 01 02 03 04 05 06 07 08 2009
  56. 56. Funder Journal Investigator Institution Study
  57. 57. Funder Journal Investigator Institution Study funded by impact years since sector humans? NIH? factor first paper size mice? size of strength of # pubs grant policy impact plants? # citations rank sharing open cancer? plan req’d? access? previously country shared? clinical funded by number of trial? non-NIH? microarray previously reused? number of studies authors published gender year
  58. 58. study type
  59. 59. author “experience” Author publication history: Author name Author-ity web service Torvik & Smalheiser. (2009). Author Name disambiguation: Disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3):11. Citation counts:
  60. 60. author gender
  61. 61. institution rank Yu et al. BMC medical informatics and decision making (2007) vol. 7 pp. 17
  62. 62. funding level PubMed grant lists + NIH grant details
  63. 63. funder mandates Requires a data sharing plan for studies funded after October 2003 that receive more than $500 000 in direct funding per year
  64. 64. journal mandates “An inherent principle of publication is that others should be able to replicate and build upon the authors' published claims. Therefore, a condition of publication in a Nature journal is that authors are required to make materials, data and associated protocols available in a publicly accessible database …” http://www.nature.com/authors/editorial_policies/availability.html http://www.nature.com/nature/journal/v453/n7197/index.html
  65. 65. and so on... 124 variables
  66. 66. stats Univariate proportions Factor analysis Logistic regression Second-order factor analysis More logistic regression
  67. 67. Proportion of datasets shared 0.0 0.2 0.4 0.6 0.8 1.0 Physiol Genomics PLoS Genet Genome Biol Microbiology PLoS One BMC Genomics Plant Cell Genome Res Eukaryot Cell Appl Environ Microbiol BMC Med Genomics Hum Mol Genet Proc Natl Acad Sci U S A Infect Immun Am J Respir Cell Mol Biol Dev Biol J Bacteriol Mol Endocrinol BMC Cancer Plant Physiol Biol Reprod Blood J Immunol FASEB J Toxicol Sci J Exp Bot Nucleic Acids Res Diabetes Mol Cell Biol Mol Cancer Ther BMC Bioinformatics Stem Cells FEBS Lett J Neurosci Am J Pathol J Biol Chem J Virol OTHER Cancer Res J Clin Endocrinol Metab Plant Mol Biol Clin Cancer Res Genomics Journals Invest Ophthalmol Vis Sci Mol Hum Reprod Carcinogenesis Gene Endocrinology Oncogene Cancer Lett Biochem Biophys Res Commun
  68. 68. Proportion of datasets shared 0.0 0.2 0.4 0.6 0.8 1.0 Stanford University University of Pennsylvania University of Illinois University of California, Los Angeles University of Wisconsin, Madison University of Washington University of California, Davis The University of British Columbia University of California, San Francisco University of Florida University of California, San Diego University of Minnesota, Twin Cities Baylor College of Medicine OTHER Max Planck Gesellschaft Harvard University Duke University Medical Center Yale University Johns Hopkins University University of Pittsburgh Washington University in Saint Louis University of Toronto University of California, Berkeley University of Michigan, Ann Arbor Michigan State University Institutions National Cancer Institute Tokyo Daigaku
  69. 69. Proportion of datasets shared 0.0 0.2 0.4 0.6 0.8 1.0 Stanford University University of Pennsylvania University of Illinois University of California, Los Angeles University of Wisconsin, Madison University of Washington University of California, Davis The University of British Columbia University of California, San Francisco University of Florida University of California, San Diego University of Minnesota, Twin Cities Baylor College of Medicine OTHER Max Planck Gesellschaft Harvard University Duke University Medical Center Yale University Johns Hopkins University University of Pittsburgh Washington University in Saint Louis University of Toronto University of California, Berkeley University of Michigan, Ann Arbor Michigan State University Institutions National Cancer Institute Tokyo Daigaku
  70. 70. Proportion of datasets shared 0.0 0.2 0.4 0.6 0.8 1.0 1 101 201 301 401 501 601 701 801 901 1001 1101 1201 1301 rank 1401 1501 1601 1701 1801 1901 Institution
  71. 71. Multivariate nonlinear regressions with interactions Odds Ratio 0.25 0.50 1.00 2.00 4.00 8.00 Has journal policy Multivariate nonlinear regressions with interactions Count of R01 & other NIH grants Odds Ratio 0.95 0.25 0.50 1.00 2.00 4.00 8.00 Authors prev GEOAE sharing & OA & microarray creation Has journal policy NO K funding other P funding Count of R01 & or NIH grants 0.95 Authors prev GEOAE sharing & OA & microarray creation NO K Journalfunding funding or P impact Institution high citations & collaboration Journal policy consequences & Journal impact long halflife Journal policy consequences & long halflife Institution high citations NOTcollaboration & animals or mice Instititution is government & NOT higher ed NOT animals or mice Last author num prev pubs & first year pub Large NIH grant Instititution is government & NOT higher ed Humans & cancer NO geo reuse + YES high institution output Last author num prev pubs & first year pub First author num prev pubs & first year pub Large NIH grant Humans & cancer NO geo reuse + YES high institution output First author num prev pubs & first year pub
  72. 72. Odds Ratio 0.25 0.50 1.00 2.00 4.00 OA journal & previous GEO-AE sharing Amount of NIH funding 0.95 Journal impact factor and policy Higher Ed in USA Cancer & humans
  73. 73. • association not causation • lots of assumptions • don’t know how generalizable it is • hypothesis-generating http://www.flickr.com/photos/vlastula/300102949/
  74. 74. what isn’t shared? who isn’t sharing it? • those studying cancer • on human patient data • in journals with few data sharing policies (clincal journals) • labs with fewer funding sources • ...
  75. 75. (what is shared? who is sharing it?) • investigators who have shared before • investigators who publish in open access journals • from Stanford • in Physiological Genomics • ...
  76. 76. Take home • current data repositories are not representative of all data generated • they are missing some of the good stuff • Good news: actionable to learn from the leaders and focus on the laggards
  77. 77. but how much isn’t  shared? what isn’t shared? who isn’t sharing it? why not? how much does it matter? what can we do  about it?
  78. 78. http://www.flickr.com/photos/jima/606588905/
  79. 79. Withholding is associated with  industry links, competitiveness industry involvement perceived competitiveness of field 0 1 2 3 40% of surveyed scientists said data  sharing was discouraged during their  training! Blumenthal et al. Acad Med. 2006
  80. 80. Withhold because too much effort,  desire for continued publishing  sharing is too much effort want student or jr faculty to publish more they themselves want to publish more cost industrial sponsor confidentiality commercial value of results 0% 20% 40% 60% 80% Campbell et al. JAMA 2002.
  81. 81. Comments show desire for control `Before I send you the data could I ask what you want it for?' `Can you be more explicit, please, about the analyses you have in  mind and what you plan to do with them?' `We'll have to discuss your request with the other coauthors.   Before we do that, I'd like to know your proposed analysis plan.'  `We are not finished using the data, but when we are finished with  it, we would be open to requests for the data.' `Any use of the data other than for the specific purpose laid down  in the contract of collaboration is effectively ruled out.' Reidpath et al. Bioethics 2001.
  82. 82. but how much isn’t  shared? what isn’t shared? who isn’t sharing it? why not? how much does it matter? what can we do  about it?
  83. 83. Estimating societal benefit ‐ assume each database hit saves $0.10, or a  fraction of data collection costs ‐ assume the value is approximated by the  (idealized) funding target for data  maintenance:  20‐25% the cost of generating the data Remembering, moreover, the indirect benefits are much  higher than the direct ones. Ball et al. Nature Biotechol. 2004.
  84. 84. Number of stakeholders Foster et al. Share and share alike: deciding how to distribute the scientific and social benefits of genomic data. Nature Reviews Genetics 8, 633-639
  85. 85. Impact on training Survey of doctoral students and postdocs: 23.0% been denied access to information, data, materials, or programming associated with published research 28-50% reported withholding caused negative effects on these aspects of their training: •progress of their research, •rate of discovery in their lab/research group, •quality of their relationships with academic scientists, •quality of their education, •level of communication in their lab/research group. Vogeli et al. Acad Med. 2006 Feb; 81(2):128-36
  86. 86. More research needs to be done!
  87. 87. but how much isn’t  shared? what isn’t shared? who isn’t sharing it? why not? how much does it matter? what can we do  about it?
  88. 88. Look to the leaders and laggards • Stanford • Physiological Genomics • cancer data • human data • those who haven’t shared before
  89. 89. http://www.flickr.com/photos/sunrise/35819369/
  90. 90. Measuring personal benefit:   increased citations Gleditsch et al. Int Studies Perspectives. 2003. Piwowar et al. PLoS ONE. 2007.
  91. 91. 70% more citations Piwowar et al. PLoS ONE. 2007.
  92. 92. What would make it easier?  help  and straightforward guidelines more funder time and money help with confidentiality issues on-site help more training better guidelines better tools simpler requirements less staff turn-over 0% 25% 50% 75% Hedstrom et al. IASSIST 2006.
  93. 93. What would make it easier?  help  and straightforward guidelines more funder time and money help with confidentiality issues on-site help more training better guidelines better tools simpler requirements less staff turn-over 0% 25% 50% 75% Hedstrom et al. IASSIST 2006.
  94. 94. What would make it easier?  help  and straightforward guidelines more funder time and money help with confidentiality issues on-site help more training better guidelines better tools simpler requirements less staff turn-over 0% 25% 50% 75% Hedstrom et al. IASSIST 2006.
  95. 95. Incentives to share: perceived value,  mandates, recognition as publication if I thought it would really benefit others if required for future funding if required for publication if deposits counted as a publication if citations to data were valued if monetary compensation 0% 25% 50% 75% Hedstrom. Society of Am Archivists Ann Meeting. 2008.
  96. 96. Incentives to share: perceived value,  mandates, recognition as publication if I thought it would really benefit others if required for future funding if required for publication if deposits counted as a publication if citations to data were valued if monetary compensation 0% 25% 50% 75% Hedstrom. Society of Am Archivists Ann Meeting. 2008.
  97. 97. http://www.flickr.com/photos/gatewaystreets/3838452287/
  98. 98. • #oa.data • Science Commons • DataCite • Dataverse • MGED • Open Notebook Science • Friendfeed • Nature editorials • many others...
  99. 99. NSF-funded distributed framework and cyberinfrastructure for environmental science. Dryad is a repository of data underlying scientific publications, with an initial focus on evolution, ecology, and related fields. The National Evolutionary Synthesis Center, NSF-funded: • Duke University, • UNC at Chapel Hill • North Carolina State University
  100. 100. http://www.flickr.com/photos/g_kat26/4255119413/
  101. 101. Begin to investigate reuse http://www.flickr.com/photos/boitabulle/3668162701/
  102. 102. who reuses data? why? when? who doesn’t? which datasets are most likely to be  reused? how many datasets could be reused but  aren’t? why aren’t they? what can we do about it? what should we do about it?
  103. 103. I share my code and data at http://www.researchremix.org Sharing data is not easy. Some is better than none. Be the change you want to see. http://www.flickr.com/photos/myklroventine/892446624/
  104. 104. Dept of Biomedical Informatics at U of Pittsburgh NLM for training grant funding Open science online community and those who release their articles, datasets and photos openly NEDCC thank you
  105. 105. Once shared, always there?
  106. 106. Data contacts and storage decay  with time URL decay:                                                    email decay: Supplementary information:  in 6 top journals:      5% unavailable after 2 years, 10% unavail after 5 years Evangelou et al.  FASEB J.  2006. Wren.  Bioinformatics 2008. Wren et al.  EMBO Rep 2006.
  107. 107. Benefits both societal and personal saves other people effort for the public good will be cited and enhance my reputation saves me effort in answering questions saves me effort in managing my data 0% 20% 40% 60% 80% Hedstrom et al. IASSIST 2006.
  108. 108. http://www.flickr.com/photos/sunrise/35819369/ http://www.flickr.com/photos/fboyd/2156630044/
  109. 109. http://www.flickr.com/photos/jep42/3017149415/in/set-72157608797298056/

×