Your SlideShare is downloading. ×
NEDCC 2010 Piwowar Leaders and Laggards
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

NEDCC 2010 Piwowar Leaders and Laggards

696

Published on

"Leaders and Laggards in the preservation of raw biomedical research data" presented at NEDCC 2010, The Tectonics of Digital Curation …

"Leaders and Laggards in the preservation of raw biomedical research data" presented at NEDCC 2010, The Tectonics of Digital Curation
A Symposium on the Shifting Preservation and Access Landscape

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
696
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Leaders and Laggards in the preservation of raw biomedical research data Heather Piwowar Department of Biomedical Informatics University of Pittsburgh Soon‐to‐be Postdoctoral Associate with  Data Observation Network for Earth (DataONE)  
  • 2. http://www.metmuseum.org/toah/ho/09/euwf/ho_24.45.1.htm
  • 3. http://www.flickr.com/photos/jsmjr/62443357/
  • 4. http://www.flickr.com/photos/camilleharrington/3587294608/
  • 5. http://www.flickr.com/photos/rkuhnau/3318245976/
  • 6. http://www.flickr.com/photos/conformpdx/1796399674/
  • 7. http://www.flickr.com/photos/rkuhnau/3317418699/
  • 8. http://www.flickr.com/photos/zemlinki/261617721/
  • 9. http://www.flickr.com/photos/tracenmatt/3020786491/
  • 10. http://www.flickr.com/photos/the-o/2078239333/
  • 11. Researchers have a choice http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  • 12. Researchers have a choice http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  • 13. Researchers have a choice PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  • 14. Researchers have a choice PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  • 15. Researchers have a choice PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  • 16. Researchers have a choice PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  • 17. Shared data benefits science Verify Understand Extend Explore Combine Synergize Train Reduce
  • 18. But... costly for authors Find Organize Document Deidentify Format Decide Ask Submit Answer questions Worry about mistakes being found Worry about data being misinterpreted Worry about being scooped Forgo money and IP and prestige???
  • 19. http://www.flickr.com/photos/75166820@N00/5318468/
  • 20. As a result, policy makers have spent  lots of time and money .... http://www.flickr.com/photos/johnnyvulkan/381941233/ http://www.flickr.com/photos/tonivc/2283676770/
  • 21. ... on initiatives, requests,  requirements, and tools • Funder requirements • Journal requirements • Public databases • Data sharing grids • Data formatting standards • Peer encouragement in editorials, letters to the editor...
  • 22. Does it work? http://www.flickr.com/photos/archeon/2941655917/
  • 23. lots of data sharing! http://www.genome.jp/en/db_growth.html
  • 24. but how much isn’t  shared? what isn’t shared? who isn’t sharing it? why not? how much does it matter? what can we do  about it?
  • 25. http://www.flickr.com/photos/paulhami/1020538523//
  • 26. who what when where why how http://www.flickr.com/photos/ryanr/142455033/
  • 27. Who to share data with? • everyone on the internet • “qualified” researchers for “qualified” research projects • friends • your lab
  • 28. What data is shared? • everything • all the datapoints • all the research notes • code • just what is needed to reproduce the results in the paper • raw? cleaned? every processing step?
  • 29. When is the data shared? • upon collection • upon submission for publication • upon publication • time-embargo after publication • upon retirement or death
  • 30. Where is it deposited? • centralized datatype specific repositories • journal supplementary information • institutional repositories • disciplinary repositories
  • 31. Why share it?
  • 32. How to share it? • massive datasets • syntactic format • semantic format • sensitive data (privacy, endangered species locations, security- related, ...) • what license or community norm
  • 33. http://www.flickr.com/photos/paulhami/1020538523//
  • 34. http://www.flickr.com/photos/paulhami/1020538523//
  • 35. • biomedical data • few privacy concerns raw data (not images or processed) • openly on the internet • upon publication • datasets are large but manageable • datatypes with mature standards for semantics, syntax, locations http://www.flickr.com/photos/paulhami/1020538523//
  • 36. but how much isn’t  shared? what isn’t shared? who isn’t sharing it? why not? how much does it matter? what can we do  about it?
  • 37. but how much isn’t  shared? what isn’t shared? who isn’t sharing it? why not? how much does it matter? what can we do  about it?
  • 38. Data sharing frequency depends  on how you ask 10% 25-40% Campbell et al. JAMA. 2002. Kyzas et al. J Natl Cancer Inst. 2005. Vogeli et al. Acad Med. 2006. Reidpath et al. Bioethics 2001.
  • 39. Data sharing frequency depends  on datatype DNA sequences gene expression microarrays proteomics spectra 0% 25% 50% 75% 100% Noor et al. PLoS Biology 2006. Ochsner et al. Nature Methods 2008. Piwowar et al. PLoS ONE 2007. Editorial. Nature Biotech 2007.
  • 40. Data sharing frequency depends  on when the data was published 40% 30% 20% 10% 0% 2000 01 02 03 04 05 06 07 08 2009
  • 41. lots of data sharing! http://www.genome.jp/en/db_growth.html
  • 42. Data sharing frequency depends  on when the data was published 40% 30% 20% 10% 0% 2000 01 02 03 04 05 06 07 08 2009
  • 43. but how much isn’t  shared? what isn’t shared? who isn’t sharing it? why not? how much does it matter? what can we do  about it?
  • 44. http://en.wikipedia.org/wiki/DNA_microarray http://en.wikipedia.org/wiki/Image:Heatmap.png http://commons.wikimedia.org/wiki/ File:DNA_double_helix_vertikal.PNG microarray data
  • 45. Funder Journal Investigator Institution Study How often was research data shared upon publication?
  • 46. How often was research data shared upon publication? Number of studies that share their data = _____________________________________ Number of studies that create data
  • 47. How often was research data shared upon publication? Number of studies that share their data = _____________________________________ Number of studies that create data
  • 48. http://www.flickr.com/photos/lofaesofa/248546821/
  • 49. Look for wetlab methods in full text: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1522022&tool=pmcentrez http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1590031&tool=pmcentrez http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1482311&tool=pmcentrez#id331936 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2082469&tool=pmcentrez http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=126870&tool=pmcentrez#id442745
  • 50. Query the full text of published articles: ("gene expression" AND microarray AND cell AND rna) AND (rneasy OR trizol OR "real-time pcr") NOT (“tissue microarray*” OR “cpg island*”)
  • 51. How often was research data shared upon publication? Number of studies that share their data = _____________________________________ Number of studies that create data
  • 52. Querying databases for citation links to data creation studies
  • 53. How often was research data shared upon publication? Number of studies that share their data = _____________________________________ Number of studies that create data
  • 54. results 11,603 studies that create data we found shared datasets for 25%
  • 55. Data sharing frequency depends  on when the data was published 40% 30% 20% 10% 0% 2000 01 02 03 04 05 06 07 08 2009
  • 56. Funder Journal Investigator Institution Study
  • 57. Funder Journal Investigator Institution Study funded by impact years since sector humans? NIH? factor first paper size mice? size of strength of # pubs grant policy impact plants? # citations rank sharing open cancer? plan req’d? access? previously country shared? clinical funded by number of trial? non-NIH? microarray previously reused? number of studies authors published gender year
  • 58. study type
  • 59. author “experience” Author publication history: Author name Author-ity web service Torvik & Smalheiser. (2009). Author Name disambiguation: Disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3):11. Citation counts:
  • 60. author gender
  • 61. institution rank Yu et al. BMC medical informatics and decision making (2007) vol. 7 pp. 17
  • 62. funding level PubMed grant lists + NIH grant details
  • 63. funder mandates Requires a data sharing plan for studies funded after October 2003 that receive more than $500 000 in direct funding per year
  • 64. journal mandates “An inherent principle of publication is that others should be able to replicate and build upon the authors' published claims. Therefore, a condition of publication in a Nature journal is that authors are required to make materials, data and associated protocols available in a publicly accessible database …” http://www.nature.com/authors/editorial_policies/availability.html http://www.nature.com/nature/journal/v453/n7197/index.html
  • 65. and so on... 124 variables
  • 66. stats Univariate proportions Factor analysis Logistic regression Second-order factor analysis More logistic regression
  • 67. Proportion of datasets shared 0.0 0.2 0.4 0.6 0.8 1.0 Physiol Genomics PLoS Genet Genome Biol Microbiology PLoS One BMC Genomics Plant Cell Genome Res Eukaryot Cell Appl Environ Microbiol BMC Med Genomics Hum Mol Genet Proc Natl Acad Sci U S A Infect Immun Am J Respir Cell Mol Biol Dev Biol J Bacteriol Mol Endocrinol BMC Cancer Plant Physiol Biol Reprod Blood J Immunol FASEB J Toxicol Sci J Exp Bot Nucleic Acids Res Diabetes Mol Cell Biol Mol Cancer Ther BMC Bioinformatics Stem Cells FEBS Lett J Neurosci Am J Pathol J Biol Chem J Virol OTHER Cancer Res J Clin Endocrinol Metab Plant Mol Biol Clin Cancer Res Genomics Journals Invest Ophthalmol Vis Sci Mol Hum Reprod Carcinogenesis Gene Endocrinology Oncogene Cancer Lett Biochem Biophys Res Commun
  • 68. Proportion of datasets shared 0.0 0.2 0.4 0.6 0.8 1.0 Stanford University University of Pennsylvania University of Illinois University of California, Los Angeles University of Wisconsin, Madison University of Washington University of California, Davis The University of British Columbia University of California, San Francisco University of Florida University of California, San Diego University of Minnesota, Twin Cities Baylor College of Medicine OTHER Max Planck Gesellschaft Harvard University Duke University Medical Center Yale University Johns Hopkins University University of Pittsburgh Washington University in Saint Louis University of Toronto University of California, Berkeley University of Michigan, Ann Arbor Michigan State University Institutions National Cancer Institute Tokyo Daigaku
  • 69. Proportion of datasets shared 0.0 0.2 0.4 0.6 0.8 1.0 Stanford University University of Pennsylvania University of Illinois University of California, Los Angeles University of Wisconsin, Madison University of Washington University of California, Davis The University of British Columbia University of California, San Francisco University of Florida University of California, San Diego University of Minnesota, Twin Cities Baylor College of Medicine OTHER Max Planck Gesellschaft Harvard University Duke University Medical Center Yale University Johns Hopkins University University of Pittsburgh Washington University in Saint Louis University of Toronto University of California, Berkeley University of Michigan, Ann Arbor Michigan State University Institutions National Cancer Institute Tokyo Daigaku
  • 70. Proportion of datasets shared 0.0 0.2 0.4 0.6 0.8 1.0 1 101 201 301 401 501 601 701 801 901 1001 1101 1201 1301 rank 1401 1501 1601 1701 1801 1901 Institution
  • 71. Multivariate nonlinear regressions with interactions Odds Ratio 0.25 0.50 1.00 2.00 4.00 8.00 Has journal policy Multivariate nonlinear regressions with interactions Count of R01 & other NIH grants Odds Ratio 0.95 0.25 0.50 1.00 2.00 4.00 8.00 Authors prev GEOAE sharing & OA & microarray creation Has journal policy NO K funding other P funding Count of R01 & or NIH grants 0.95 Authors prev GEOAE sharing & OA & microarray creation NO K Journalfunding funding or P impact Institution high citations & collaboration Journal policy consequences & Journal impact long halflife Journal policy consequences & long halflife Institution high citations NOTcollaboration & animals or mice Instititution is government & NOT higher ed NOT animals or mice Last author num prev pubs & first year pub Large NIH grant Instititution is government & NOT higher ed Humans & cancer NO geo reuse + YES high institution output Last author num prev pubs & first year pub First author num prev pubs & first year pub Large NIH grant Humans & cancer NO geo reuse + YES high institution output First author num prev pubs & first year pub
  • 72. Odds Ratio 0.25 0.50 1.00 2.00 4.00 OA journal & previous GEO-AE sharing Amount of NIH funding 0.95 Journal impact factor and policy Higher Ed in USA Cancer & humans
  • 73. • association not causation • lots of assumptions • don’t know how generalizable it is • hypothesis-generating http://www.flickr.com/photos/vlastula/300102949/
  • 74. what isn’t shared? who isn’t sharing it? • those studying cancer • on human patient data • in journals with few data sharing policies (clincal journals) • labs with fewer funding sources • ...
  • 75. (what is shared? who is sharing it?) • investigators who have shared before • investigators who publish in open access journals • from Stanford • in Physiological Genomics • ...
  • 76. Take home • current data repositories are not representative of all data generated • they are missing some of the good stuff • Good news: actionable to learn from the leaders and focus on the laggards
  • 77. but how much isn’t  shared? what isn’t shared? who isn’t sharing it? why not? how much does it matter? what can we do  about it?
  • 78. http://www.flickr.com/photos/jima/606588905/
  • 79. Withholding is associated with  industry links, competitiveness industry involvement perceived competitiveness of field 0 1 2 3 40% of surveyed scientists said data  sharing was discouraged during their  training! Blumenthal et al. Acad Med. 2006
  • 80. Withhold because too much effort,  desire for continued publishing  sharing is too much effort want student or jr faculty to publish more they themselves want to publish more cost industrial sponsor confidentiality commercial value of results 0% 20% 40% 60% 80% Campbell et al. JAMA 2002.
  • 81. Comments show desire for control `Before I send you the data could I ask what you want it for?' `Can you be more explicit, please, about the analyses you have in  mind and what you plan to do with them?' `We'll have to discuss your request with the other coauthors.   Before we do that, I'd like to know your proposed analysis plan.'  `We are not finished using the data, but when we are finished with  it, we would be open to requests for the data.' `Any use of the data other than for the specific purpose laid down  in the contract of collaboration is effectively ruled out.' Reidpath et al. Bioethics 2001.
  • 82. but how much isn’t  shared? what isn’t shared? who isn’t sharing it? why not? how much does it matter? what can we do  about it?
  • 83. Estimating societal benefit ‐ assume each database hit saves $0.10, or a  fraction of data collection costs ‐ assume the value is approximated by the  (idealized) funding target for data  maintenance:  20‐25% the cost of generating the data Remembering, moreover, the indirect benefits are much  higher than the direct ones. Ball et al. Nature Biotechol. 2004.
  • 84. Number of stakeholders Foster et al. Share and share alike: deciding how to distribute the scientific and social benefits of genomic data. Nature Reviews Genetics 8, 633-639
  • 85. Impact on training Survey of doctoral students and postdocs: 23.0% been denied access to information, data, materials, or programming associated with published research 28-50% reported withholding caused negative effects on these aspects of their training: •progress of their research, •rate of discovery in their lab/research group, •quality of their relationships with academic scientists, •quality of their education, •level of communication in their lab/research group. Vogeli et al. Acad Med. 2006 Feb; 81(2):128-36
  • 86. More research needs to be done!
  • 87. but how much isn’t  shared? what isn’t shared? who isn’t sharing it? why not? how much does it matter? what can we do  about it?
  • 88. Look to the leaders and laggards • Stanford • Physiological Genomics • cancer data • human data • those who haven’t shared before
  • 89. http://www.flickr.com/photos/sunrise/35819369/
  • 90. Measuring personal benefit:   increased citations Gleditsch et al. Int Studies Perspectives. 2003. Piwowar et al. PLoS ONE. 2007.
  • 91. 70% more citations Piwowar et al. PLoS ONE. 2007.
  • 92. What would make it easier?  help  and straightforward guidelines more funder time and money help with confidentiality issues on-site help more training better guidelines better tools simpler requirements less staff turn-over 0% 25% 50% 75% Hedstrom et al. IASSIST 2006.
  • 93. What would make it easier?  help  and straightforward guidelines more funder time and money help with confidentiality issues on-site help more training better guidelines better tools simpler requirements less staff turn-over 0% 25% 50% 75% Hedstrom et al. IASSIST 2006.
  • 94. What would make it easier?  help  and straightforward guidelines more funder time and money help with confidentiality issues on-site help more training better guidelines better tools simpler requirements less staff turn-over 0% 25% 50% 75% Hedstrom et al. IASSIST 2006.
  • 95. Incentives to share: perceived value,  mandates, recognition as publication if I thought it would really benefit others if required for future funding if required for publication if deposits counted as a publication if citations to data were valued if monetary compensation 0% 25% 50% 75% Hedstrom. Society of Am Archivists Ann Meeting. 2008.
  • 96. Incentives to share: perceived value,  mandates, recognition as publication if I thought it would really benefit others if required for future funding if required for publication if deposits counted as a publication if citations to data were valued if monetary compensation 0% 25% 50% 75% Hedstrom. Society of Am Archivists Ann Meeting. 2008.
  • 97. http://www.flickr.com/photos/gatewaystreets/3838452287/
  • 98. • #oa.data • Science Commons • DataCite • Dataverse • MGED • Open Notebook Science • Friendfeed • Nature editorials • many others...
  • 99. NSF-funded distributed framework and cyberinfrastructure for environmental science. Dryad is a repository of data underlying scientific publications, with an initial focus on evolution, ecology, and related fields. The National Evolutionary Synthesis Center, NSF-funded: • Duke University, • UNC at Chapel Hill • North Carolina State University
  • 100. http://www.flickr.com/photos/g_kat26/4255119413/
  • 101. Begin to investigate reuse http://www.flickr.com/photos/boitabulle/3668162701/
  • 102. who reuses data? why? when? who doesn’t? which datasets are most likely to be  reused? how many datasets could be reused but  aren’t? why aren’t they? what can we do about it? what should we do about it?
  • 103. I share my code and data at http://www.researchremix.org Sharing data is not easy. Some is better than none. Be the change you want to see. http://www.flickr.com/photos/myklroventine/892446624/
  • 104. Dept of Biomedical Informatics at U of Pittsburgh NLM for training grant funding Open science online community and those who release their articles, datasets and photos openly NEDCC thank you
  • 105. Once shared, always there?
  • 106. Data contacts and storage decay  with time URL decay:                                                    email decay: Supplementary information:  in 6 top journals:      5% unavailable after 2 years, 10% unavail after 5 years Evangelou et al.  FASEB J.  2006. Wren.  Bioinformatics 2008. Wren et al.  EMBO Rep 2006.
  • 107. Benefits both societal and personal saves other people effort for the public good will be cited and enhance my reputation saves me effort in answering questions saves me effort in managing my data 0% 20% 40% 60% 80% Hedstrom et al. IASSIST 2006.
  • 108. http://www.flickr.com/photos/sunrise/35819369/ http://www.flickr.com/photos/fboyd/2156630044/
  • 109. http://www.flickr.com/photos/jep42/3017149415/in/set-72157608797298056/

×