"Leaders and Laggards in the preservation of raw biomedical research data" presented at NEDCC 2010, The Tectonics of Digital Curation
A Symposium on the Shifting Preservation and Access Landscape
Call Girls Rishikesh Just Call 8250077686 Top Class Call Girl Service Available
NEDCC 2010 Piwowar Leaders and Laggards
1. Leaders and Laggards
in the preservation of
raw biomedical research data
Heather Piwowar
Department of Biomedical Informatics
University of Pittsburgh
Soon‐to‐be Postdoctoral Associate with
Data Observation Network for Earth (DataONE)
13. Researchers have a choice
PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had
non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/
Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;
http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
14. Researchers have a choice
PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had
non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/
Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;
http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
15. Researchers have a choice
PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had
non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/
Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;
http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
16. Researchers have a choice
PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had
non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/
Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;
http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
18. But... costly for authors
Find
Organize
Document
Deidentify
Format
Decide
Ask
Submit
Answer questions
Worry about mistakes being found
Worry about data being misinterpreted
Worry about being scooped
Forgo money and IP and prestige???
21. ... on initiatives, requests,
requirements, and tools
• Funder requirements
• Journal requirements
• Public databases
• Data sharing grids
• Data formatting standards
• Peer encouragement in editorials, letters to the
editor...
22. Does it work?
http://www.flickr.com/photos/archeon/2941655917/
27. Who to share data with?
• everyone on the internet
• “qualified” researchers for
“qualified” research projects
• friends
• your lab
28. What data is shared?
• everything
• all the datapoints
• all the research notes
• code
• just what is needed to reproduce
the results in the paper
• raw? cleaned?
every processing step?
29. When is the data shared?
• upon collection
• upon submission for publication
• upon publication
• time-embargo after publication
• upon retirement or death
30. Where is it deposited?
• centralized datatype specific
repositories
• journal supplementary information
• institutional repositories
• disciplinary repositories
32. How to share it?
• massive datasets
• syntactic format
• semantic format
• sensitive data (privacy, endangered
species locations, security-
related, ...)
• what license or community norm
35. • biomedical data
• few privacy concerns raw data
(not images or processed)
• openly on the internet
• upon publication
• datasets are large but manageable
• datatypes with mature standards for
semantics, syntax, locations
http://www.flickr.com/photos/paulhami/1020538523//
36. but how much isn’t
shared?
what isn’t shared?
who isn’t sharing it?
why not?
how much does it matter?
what can we do
about it?
37. but how much isn’t
shared?
what isn’t shared?
who isn’t sharing it?
why not?
how much does it matter?
what can we do
about it?
38. Data sharing frequency depends
on how you ask
10%
25-40%
Campbell et al. JAMA. 2002.
Kyzas et al. J Natl Cancer Inst. 2005.
Vogeli et al. Acad Med. 2006.
Reidpath et al. Bioethics 2001.
39. Data sharing frequency depends
on datatype
DNA sequences
gene expression microarrays
proteomics spectra
0% 25% 50% 75% 100%
Noor et al. PLoS Biology 2006.
Ochsner et al. Nature Methods 2008.
Piwowar et al. PLoS ONE 2007.
Editorial. Nature Biotech 2007.
43. but how much isn’t
shared?
what isn’t shared?
who isn’t sharing it?
why not?
how much does it matter?
what can we do
about it?
44. http://en.wikipedia.org/wiki/DNA_microarray
http://en.wikipedia.org/wiki/Image:Heatmap.png
http://commons.wikimedia.org/wiki/
File:DNA_double_helix_vertikal.PNG
microarray
data
45. Funder Journal Investigator Institution Study
How often was
research data shared upon
publication?
46. How often was
research data shared upon
publication?
Number of studies that share their data
= _____________________________________
Number of studies that create data
47. How often was
research data shared upon
publication?
Number of studies that share their data
= _____________________________________
Number of studies that create data
49. Look for wetlab methods in full text:
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1522022&tool=pmcentrez
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1590031&tool=pmcentrez
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1482311&tool=pmcentrez#id331936
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2082469&tool=pmcentrez
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=126870&tool=pmcentrez#id442745
50. Query the full text of published articles:
("gene expression" AND microarray AND cell AND rna)
AND (rneasy OR trizol OR "real-time pcr")
NOT (“tissue microarray*” OR “cpg island*”)
51. How often was
research data shared upon
publication?
Number of studies that share their data
= _____________________________________
Number of studies that create data
54. How often was
research data shared upon
publication?
Number of studies that share their data
= _____________________________________
Number of studies that create data
55. results
11,603 studies that create data
we found shared datasets for 25%
58. Funder Journal Investigator Institution Study
funded by impact years since sector humans?
NIH? factor first paper
size mice?
size of strength of # pubs
grant policy impact plants?
# citations rank
sharing open cancer?
plan req’d? access? previously country
shared? clinical
funded by number of trial?
non-NIH? microarray previously
reused? number of
studies authors
published gender
year
60. author “experience”
Author publication history:
Author name Author-ity web service
Torvik & Smalheiser. (2009). Author Name
disambiguation: Disambiguation in MEDLINE. ACM Transactions on
Knowledge Discovery from Data, 3(3):11.
Citation counts:
64. funder mandates
Requires a data sharing plan
for studies funded after October 2003
that receive more than $500 000 in
direct funding per year
65. journal mandates
“An inherent principle of publication is that
others should be able to replicate and build
upon the authors' published claims.
Therefore, a condition of publication
in a Nature journal is that authors are
required to make materials, data and
associated protocols available in a publicly
accessible database …”
http://www.nature.com/authors/editorial_policies/availability.html
http://www.nature.com/nature/journal/v453/n7197/index.html
68. Proportion of datasets shared
0.0
0.2
0.4
0.6
0.8
1.0
Physiol Genomics
PLoS Genet
Genome Biol
Microbiology
PLoS One
BMC Genomics
Plant Cell
Genome Res
Eukaryot Cell
Appl Environ Microbiol
BMC Med Genomics
Hum Mol Genet
Proc Natl Acad Sci U S A
Infect Immun
Am J Respir Cell Mol Biol
Dev Biol
J Bacteriol
Mol Endocrinol
BMC Cancer
Plant Physiol
Biol Reprod
Blood
J Immunol
FASEB J
Toxicol Sci
J Exp Bot
Nucleic Acids Res
Diabetes
Mol Cell Biol
Mol Cancer Ther
BMC Bioinformatics
Stem Cells
FEBS Lett
J Neurosci
Am J Pathol
J Biol Chem
J Virol
OTHER
Cancer Res
J Clin Endocrinol Metab
Plant Mol Biol
Clin Cancer Res
Genomics
Journals
Invest Ophthalmol Vis Sci
Mol Hum Reprod
Carcinogenesis
Gene
Endocrinology
Oncogene
Cancer Lett
Biochem Biophys Res Commun
69. Proportion of datasets shared
0.0
0.2
0.4
0.6
0.8
1.0
Stanford University
University of Pennsylvania
University of Illinois
University of California, Los Angeles
University of Wisconsin, Madison
University of Washington
University of California, Davis
The University of British Columbia
University of California, San Francisco
University of Florida
University of California, San Diego
University of Minnesota, Twin Cities
Baylor College of Medicine
OTHER
Max Planck Gesellschaft
Harvard University
Duke University Medical Center
Yale University
Johns Hopkins University
University of Pittsburgh
Washington University in Saint Louis
University of Toronto
University of California, Berkeley
University of Michigan, Ann Arbor
Michigan State University
Institutions
National Cancer Institute
Tokyo Daigaku
70. Proportion of datasets shared
0.0
0.2
0.4
0.6
0.8
1.0
Stanford University
University of Pennsylvania
University of Illinois
University of California, Los Angeles
University of Wisconsin, Madison
University of Washington
University of California, Davis
The University of British Columbia
University of California, San Francisco
University of Florida
University of California, San Diego
University of Minnesota, Twin Cities
Baylor College of Medicine
OTHER
Max Planck Gesellschaft
Harvard University
Duke University Medical Center
Yale University
Johns Hopkins University
University of Pittsburgh
Washington University in Saint Louis
University of Toronto
University of California, Berkeley
University of Michigan, Ann Arbor
Michigan State University
Institutions
National Cancer Institute
Tokyo Daigaku
72. Multivariate nonlinear regressions with interactions
Odds Ratio
0.25 0.50 1.00 2.00 4.00 8.00
Has journal policy
Multivariate nonlinear regressions with interactions
Count of R01 & other NIH grants Odds Ratio
0.95
0.25 0.50 1.00 2.00 4.00 8.00
Authors prev GEOAE sharing & OA & microarray creation
Has journal policy
NO K funding other P funding
Count of R01 & or NIH grants
0.95
Authors prev GEOAE sharing & OA & microarray creation
NO K Journalfunding
funding or P impact
Institution high citations & collaboration
Journal policy consequences & Journal impact long halflife
Journal policy consequences & long halflife
Institution high citations NOTcollaboration & animals or mice
Instititution is government & NOT higher ed
NOT animals or mice
Last author num prev pubs & first year pub
Large NIH grant
Instititution is government & NOT higher ed Humans & cancer
NO geo reuse + YES high institution output
Last author num prev pubs & first year pub
First author num prev pubs & first year pub
Large NIH grant
Humans & cancer
NO geo reuse + YES high institution output
First author num prev pubs & first year pub
73. Odds Ratio
0.25 0.50 1.00 2.00 4.00
OA journal & previous GEO-AE sharing
Amount of NIH funding
0.95
Journal impact factor and policy
Higher Ed in USA
Cancer & humans
74. • association not causation
• lots of assumptions
• don’t know how generalizable it is
• hypothesis-generating
http://www.flickr.com/photos/vlastula/300102949/
75. what isn’t shared?
who isn’t sharing it?
• those studying cancer
• on human patient data
• in journals with few data sharing policies
(clincal journals)
• labs with fewer funding sources
• ...
76. (what is shared?
who is sharing it?)
• investigators who have shared before
• investigators who publish in open access journals
• from Stanford
• in Physiological Genomics
• ...
77. Take home
• current data repositories are not representative
of all data generated
• they are missing some of the good stuff
• Good news: actionable to learn from the leaders
and focus on the laggards
78. but how much isn’t
shared?
what isn’t shared?
who isn’t sharing it?
why not?
how much does it matter?
what can we do
about it?
81. Withhold because too much effort,
desire for continued publishing
sharing is too much effort
want student or jr faculty to publish more
they themselves want to publish more
cost
industrial sponsor
confidentiality
commercial value of results
0% 20% 40% 60% 80%
Campbell et al. JAMA 2002.
83. but how much isn’t
shared?
what isn’t shared?
who isn’t sharing it?
why not?
how much does it matter?
what can we do
about it?
84. Estimating societal benefit
‐ assume each database hit saves $0.10, or a
fraction of data collection costs
‐ assume the value is approximated by the
(idealized) funding target for data
maintenance:
20‐25% the cost of generating the data
Remembering, moreover, the indirect benefits are much
higher than the direct ones.
Ball et al. Nature Biotechol. 2004.
85. Number of stakeholders
Foster et al. Share and share alike: deciding how to distribute the scientific and social benefits of genomic data.
Nature Reviews Genetics 8, 633-639
86. Impact on training
Survey of doctoral students and postdocs:
23.0% been denied access to information, data,
materials, or programming associated with published
research
28-50% reported withholding caused negative effects on
these aspects of their training:
•progress of their research,
•rate of discovery in their lab/research group,
•quality of their relationships with academic scientists,
•quality of their education,
•level of communication in their lab/research group.
Vogeli et al. Acad Med. 2006 Feb; 81(2):128-36
93. What would make it easier? help
and straightforward guidelines
more funder time and money
help with confidentiality issues
on-site help
more training
better guidelines
better tools
simpler requirements
less staff turn-over
0% 25% 50% 75%
Hedstrom et al. IASSIST 2006.
94. What would make it easier? help
and straightforward guidelines
more funder time and money
help with confidentiality issues
on-site help
more training
better guidelines
better tools
simpler requirements
less staff turn-over
0% 25% 50% 75%
Hedstrom et al. IASSIST 2006.
95. What would make it easier? help
and straightforward guidelines
more funder time and money
help with confidentiality issues
on-site help
more training
better guidelines
better tools
simpler requirements
less staff turn-over
0% 25% 50% 75%
Hedstrom et al. IASSIST 2006.
100. NSF-funded distributed framework
and cyberinfrastructure for
environmental science.
Dryad is a repository of data
underlying scientific publications,
with an initial focus on evolution,
ecology, and related fields.
The National Evolutionary
Synthesis Center, NSF-funded:
• Duke University,
• UNC at Chapel Hill
• North Carolina State University
103. who reuses data?
why?
when?
who doesn’t?
which datasets are most likely to be
reused?
how many datasets could be reused but
aren’t?
why aren’t they?
what can we do about it?
what should we do about it?
104. I share my code and data at http://www.researchremix.org
Sharing data is not easy.
Some is better than none.
Be the change you want to see.
http://www.flickr.com/photos/myklroventine/892446624/
105. Dept of Biomedical Informatics at U of Pittsburgh
NLM for training grant funding
Open science online community and those who release their
articles, datasets and photos openly
NEDCC
thank you
109. Benefits both societal and personal
saves other people effort
for the public good
will be cited and enhance my reputation
saves me effort in answering questions
saves me effort in managing my data
0% 20% 40% 60% 80%
Hedstrom et al. IASSIST 2006.