Research into Open Research Data

2,969 views
2,919 views

Published on

Presentation by Heather Piwowar as part of UBC's Open Access Week 2010

Published in: Technology, Sports, Education

Research into Open Research Data

  1. 1. Open research data Heather Piwowar DataONE postdoc with Dryad and NESCent, UBC @researchremix OA week 2010 University of British Columbia
  2. 2. #1 It matters
  3. 3. http://www.metmuseum.org/toah/ho/09/euwf/ho_24.45.1.htm
  4. 4. http://www.flickr.com/photos/jsmjr/62443357/
  5. 5. http://www.flickr.com/photos/camilleharrington/3587294608/
  6. 6. http://www.flickr.com/photos/rkuhnau/3318245976/
  7. 7. http://www.flickr.com/photos/conformpdx/1796399674/
  8. 8. http://www.flickr.com/photos/rkuhnau/3317418699/
  9. 9. http://www.flickr.com/photos/zemlinki/261617721/
  10. 10. http://www.flickr.com/photos/tracenmatt/3020786491/
  11. 11. http://www.flickr.com/photos/the-o/2078239333/
  12. 12. http://www.flickr.com/photos/75166820@N00/5318468/
  13. 13. #2 Wayfinding + progress
  14. 14. http://www.flickr.com/photos/paulhami/1020538523//
  15. 15. Which data? http://www.flickr.com/photos/paulhami/1020538523//
  16. 16. Where? http://www.flickr.com/photos/paulhami/1020538523//
  17. 17. With whom? http://www.flickr.com/photos/paulhami/1020538523//
  18. 18. When? http://www.flickr.com/photos/paulhami/1020538523//
  19. 19. Under what terms? http://www.flickr.com/photos/paulhami/1020538523//
  20. 20. http://www.flickr.com/photos/paulhami/1020538523//
  21. 21. Find Organize Document Deidentify Format Ask Submit Answer questions Worry about mistakes being found Worry about data being misinterpreted Worry about being scooped Forgo money and IP and prestige???
  22. 22. not very motivating.
  23. 23. http://www.flickr.com/photos/johnnyvulkan/381941233/ http://www.flickr.com/photos/tonivc/2283676770/
  24. 24. a) policies + expectations - NSF - Joint Data Archiving Policy - BioMed Central - PLoS
  25. 25. b) repositories - datatype-based - institution-based - discipline-based - journal-based
  26. 26. c) standards - data licenses - data citation - IDs for datasets, people, entities
  27. 27. d) part of something bigger - open government data - citizen science - supplemental materials - dataset-based usage metrics - awards, recognition
  28. 28. #3 Is it working?
  29. 29. lots of data sharing! http://www.genome.jp/en/db_growth.html
  30. 30. but how much isn’t  shared? what isn’t shared? who isn’t sharing it? why not? how much does it matter? what can we do  about it?
  31. 31. you can not manage what you do not measure quote: Lord Kelvin http://www.flickr.com/photos/archeon/2941655917/
  32. 32. http://www.flickr.com/photos/ryanr/142455033/
  33. 33. Why is it important? Are we sure?
  34. 34. Errors. More than half of all papers contain errors 5‐10% contain errors that change the conclusions Gore et al 1977, Kantoer and Taylor 1994, McGuigan 1995, Hurlbert and White 1993
  35. 35. Ok, let’s share on request.
  36. 36. Doesn’t work self-reported denying a request in last 3 years trainees self-reported denying a request been denied access to data, materials, code authors “not able to retrieve raw data” not willing to release data 0% 10% 20% 30% 40% Campbell et al. JAMA. 2002. Kyzas et al. J Natl Cancer Inst. 2005. Vogeli et al. Acad Med. 2006. Reidpath et al. Bioethics 2001.
  37. 37. Don’t get the email Evangelou et al.  FASEB J.  2006. Wren.  Bioinformatics 2008. Wren et al.  EMBO Rep 2006.
  38. 38. Say no want to publish more papers first want exclusive use ensure data confidentiality control avoid cost of preparation 0% 10% 20% 30% 40% 50% Hedstrom. Society of Am Archivists Ann Meeting. 2008.
  39. 39. Ask why `Before I send you the data could I ask what you want it for?' `Can you be more explicit, please, about the analyses you have in  mind and what you plan to do with them?' `We'll have to discuss your request with the other coauthors.   Before we do that, I'd like to know your proposed analysis plan.'  `We are not finished using the data, but when we are finished with  it, we would be open to requests for the data.' `Any use of the data other than for the specific purpose laid down  in the contract of collaboration is effectively ruled out.' Reidpath et al. Bioethics 2001.
  40. 40. Not efficient.
  41. 41. Not efficient. Not fair. Not random: ‐ young ‐ productive Campbell et all 2000
  42. 42. Has real costs. Survey of doctoral students and postdocs: 28-50% reported withholding negative effects: • hurt progress of their research, • hurt rate of discovery in their lab/research group, • hurt quality of their relationships with academic scientists, • hurt quality of their education, • hurt level of communication in their lab/research group. Vogeli et al. Acad Med. 2006 Feb; 81(2):128-36
  43. 43. Ok, then on a website? No. Urls stop working. Evangelou et al.  FASEB J.  2006. Wren.  Bioinformatics 2008. Wren et al.  EMBO Rep 2006.
  44. 44. Ok, in a repository?
  45. 45. lots of data sharing! http://www.genome.jp/en/db_growth.html
  46. 46. http://www.flickr.com/photos/g_kat26/4255119413/
  47. 47. http://www.flickr.com/photos/jima/606588905/
  48. 48. Combined, these full-text portals reach 85% of the articles available through U of Pittsburgh library subscriptions.
  49. 49. microarray data http://en.wikipedia.org/wiki/DNA_microarray http://en.wikipedia.org/wiki/Image:Heatmap.png http://commons.wikimedia.org/wiki/File:DNA_double_helix_vertikal.PNG
  50. 50. 11,603 studies that created gene expression microarray data
  51. 51. Funder Journal Investigator Institution Study Is research data shared after publication?
  52. 52. Funder Journal Investigator Institution Study funded by impact years since sector humans? NIH? factor first paper size mice? size of strength of # pubs grant policy impact plants? # citations rank sharing open cancer? plan req’d? access? previously country shared? clinical funded by number of trial? non-NIH? microarray previously reused? number of studies authors published gender year
  53. 53. journal data sharing policy “An inherent principle of publication is that others should be able to replicate and build upon the authors' published claims. Therefore, a condition of publication in a Nature journal is that authors are required to make materials, data and associated protocols available in a publicly accessible database …” http://www.nature.com/authors/editorial_policies/availability.html http://www.nature.com/nature/journal/v453/n7197/index.html
  54. 54. journal rank
  55. 55. institution rank Yu et al. BMC medical informatics and decision making (2007) vol. 7 pp. 17
  56. 56. funding level PubMed grant lists + NIH grant details
  57. 57. study type
  58. 58. author gender
  59. 59. and so on... 124 variables
  60. 60. 11,603 studies 25% had links from datasets in databases
  61. 61. Proportion of articles with shared datasets, by year 0.35 Proportion of articles with datasets found in GEO or ArrayExpress 0.30 0.25 0.20 0.15 Across time 0.10 0.05 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Year article published
  62. 62. What can we do about it?
  63. 63. What can we do about it? Funder policies.
  64. 64. 19% Piwowar and Chapman. Journal of Informetrics 2010
  65. 65. What can we do about it? Journal policies.
  66. 66. We looked at data sharing policies within Instruction to Author statements of 70 journals, as they apply to gene expression microarray data. Piwowar and Chapman. ELPUB 2008
  67. 67. strength of data sharing policies No applicable policy (43%) Weak policy (24%) should, recommend, request must, but without requiring database accession number Strong policy (33%) must, required, condition of publication requires database accession number
  68. 68. High-impact journals tend to have a strong data-sharing policy
  69. 69. Articles published in journals with a strong data-sharing policy are more likely to have publicly available datasets
  70. 70. What can we do about it? Learn • Learn from those who do it well • Focus on places that need it
  71. 71. Proportion of datasets shared 0.0 0.2 0.4 0.6 0.8 1.0 Physiol Genomics PLoS Genet Genome Biol Microbiology PLoS One BMC Genomics Plant Cell Genome Res Eukaryot Cell Appl Environ Microbiol BMC Med Genomics Hum Mol Genet Proc Natl Acad Sci U S A Infect Immun Am J Respir Cell Mol Biol Dev Biol J Bacteriol Mol Endocrinol BMC Cancer Plant Physiol Biol Reprod Blood J Immunol FASEB J Toxicol Sci J Exp Bot Nucleic Acids Res Diabetes Mol Cell Biol Mol Cancer Ther BMC Bioinformatics Stem Cells FEBS Lett J Neurosci Am J Pathol J Biol Chem J Virol OTHER Cancer Res J Clin Endocrinol Metab Plant Mol Biol Clin Cancer Res Genomics Journals Invest Ophthalmol Vis Sci Mol Hum Reprod Carcinogenesis Gene Endocrinology Oncogene Cancer Lett Biochem Biophys Res Commun (Physiological Genomics)
  72. 72. Proportion of datasets shared 0.0 0.2 0.4 0.6 0.8 1.0 Stanford University University of Pennsylvania University of Illinois University of California, Los Angeles University of Wisconsin, Madison University of Washington University of California, Davis The University of British Columbia University of California, San Francisco University of Florida University of California, San Diego University of Minnesota, Twin Cities Baylor College of Medicine OTHER Max Planck Gesellschaft Harvard University Duke University Medical Center Yale University Johns Hopkins University University of Pittsburgh (Stanford) Washington University in Saint Louis University of Toronto University of California, Berkeley University of Michigan, Ann Arbor Michigan State University Institutions National Cancer Institute Tokyo Daigaku
  73. 73. Proportion of datasets shared 0.0 0.2 0.4 0.6 0.8 1.0 1 101 201 301 401 501 601 701 801 901 1001 1101 1201 1301 rank 1401 1501 1601 1701 1801 1901 Institution
  74. 74. Multivariate nonlinear regressions with interactions Odds Ratio 0.25 0.50 1.00 2.00 4.00 8.00 Has journal policy Multivariate nonlinear regressions with interactions Count of R01 & other NIH grants Odds Ratio 0.95 0.25 0.50 1.00 2.00 4.00 8.00 Authors prev GEOAE sharing & OA & microarray creation Has journal policy NO K funding other P funding Count of R01 & or NIH grants 0.95 Authors prev GEOAE sharing & OA & microarray creation NO K Journalfunding funding or P impact Institution high citations & collaboration Journal policy consequences & Journal impact long halflife Journal policy consequences & long halflife Institution high citations NOTcollaboration & animals or mice Instititution is government & NOT higher ed NOT animals or mice Last author num prev pubs & first year pub Large NIH grant Instititution is government & NOT higher ed Humans & cancer NO geo reuse + YES high institution output Last author num prev pubs & first year pub First author num prev pubs & first year pub Large NIH grant Humans & cancer NO geo reuse + YES high institution output First author num prev pubs & first year pub
  75. 75. Multivariate nonlinear regressions with interactions Odds Ratio 0.25 0.50 1.00 2.00 4.00 8.00 Has journal policy Multivariate nonlinear regressions with interactions Count of R01 & other NIH grants Odds Ratio 0.95 0.25 0.50 1.00 2.00 4.00 8.00 Authors prev GEOAE sharing & OA & microarray creation Has journal policy NO K funding other P funding Count of R01 & or NIH grants 0.95 Authors prev GEOAE sharing & OA & microarray creation NO K Journalfunding funding or P impact Institution high citations & collaboration Journal policy consequences & Journal impact long halflife Journal policy consequences & long halflife Institution high citations NOTcollaboration & animals or mice Instititution is government & NOT higher ed NOT animals or mice Last author num prev pubs & first year pub Large NIH grant Instititution is government & NOT higher ed Humans & cancer NO geo reuse + YES high institution output Last author num prev pubs & first year pub First author num prev pubs & first year pub Large NIH grant Humans & cancer NO geo reuse + YES high institution output First author num prev pubs & first year pub
  76. 76. Multivariate nonlinear regression with interactions Odds Ratio 0.25 0.50 1.00 2.00 4.00 OA journal & previous GEO-AE sharing Amount of NIH funding 0.95 Journal impact factor and policy Higher Ed in USA Cancer & humans
  77. 77. Multivariate nonlinear regression with interactions Odds Ratio 0.25 0.50 1.00 2.00 4.00 OA journal & previous GEO-AE sharing Amount of NIH funding 0.95 Journal impact factor and policy Higher Ed in USA Cancer & humans
  78. 78. Carrot? http://www.flickr.com/photos/sunrise/35819369/
  79. 79. currency of value? Citations.
  80. 80. currency of value? Citations. $50! Diamond,Arthur M. What is a Citation Worth?. The Journal of Human Resources (1986) vol. 21 (2) pp. 200-215
  81. 81. dataset 85 cancer microarray trials published in 1999-2003, as identified by Ntzani and Ioannidis (2003) citations ISI Web of Science Citation index, citations from 2004-2005 data sharing locations Publisher and lab websites, microarray databases, WayBack Internet Archive, Oncomine statistics Multivariate linear regression
  82. 82. Note: log scale
  83. 83. ~70%
  84. 84. Next? http://www.flickr.com/photos/gatewaystreets/3838452287/
  85. 85. Abadie et al. Journal of the American Statistical Association 2010
  86. 86. http://www.flickr.com/photos/boitabulle/3668162701/
  87. 87. http://upload.wikimedia.org/wikipedia/commons/thumb/e/e6/ Gamma_distribution_pdf.svg/500px-Gamma_distribution_pdf.svg.png
  88. 88. #4 We are the culture. Let’s do it.
  89. 89. http://www.flickr.com/photos/joellevand/279468607/
  90. 90. http://www.flickr.com/photos/huzzahvintage/4577075021/
  91. 91. a) in our communities - strengthening policies: - journal, conference, institutional - decision-makers - role-models and educators
  92. 92. b) in our tools - measure opinions - measure use - be transparent!
  93. 93. c) with our data - share it. - ugly? incomplete? strange? “Flawed, but out there” is a million times better than “perfect, but unattainable” http://sciblogs.co.nz/seeing-data/2010/10/12/the-zen-of-open-data/
  94. 94. “Does anyone want your data? That’s hard to predict […] After all, no one ever knocked on your door asking to buy those figurines collecting dust in your cabinet before you listed them on eBay. Your data, too, may simply be awaiting an effective matchmaker.” Got data? Nature Neuroscience (2007)
  95. 95. I post my data, code, and statistical scripts: http://researchremix.org Share yours too! http://www.flickr.com/photos/myklroventine/892446624/
  96. 96. More info? • OATP oa.data tag  on Connotea, Twi1er • FriendFeed • Mendeley  “data sharing” group • @researchremix  piwowar@zoology.ubc.ca 
  97. 97. thank you Todd Vision, Michael Whitlock, Wendy Chapman The open science online community and those who release their articles, datasets and photos openly
  98. 98. http://www.flickr.com/photos/youraddresshere/6649228/

×