Thesis Proposal Piwowar Presentation 20091109

Foundational studies for measuring the
impact, prevalence, and patterns
of publicly sharing
biomedical research data

Heather Piwowar
Department of Biomedical Informatics
University of Pittsburgh

Sharing research data

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/
Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;
http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Sharing research data

PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had
non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/
Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;
http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

http://www.flickr.com/photos/75166820@N00/5318468/

Shared data beneﬁts science
Verify
Understand
Extend
Explore
Combine
Synergize
Train
Reduce

But... costly for authors
Find
Organize
Document
Deidentify
Format
Decide
Ask
Submit

Answer questions
Worry about mistakes being found
Worry about data being misinterpreted
Worry about being scooped
Forgo money and IP and prestige???

As a result, policy makers have spent
lots of time and money ....

http://www.flickr.com/photos/johnnyvulkan/381941233/
http://www.flickr.com/photos/tonivc/2283676770/

... on initiatives, requests,
requirements, and tools
Funder data sharing requirements

Journal requirements and requests

Databases

Data sharing collaboration grids

Standards

Editorials, letters to the editor, discussion....

http://www.flickr.com/photos/mesh/14102209/

lots of data sharing!

http://www.genome.jp/en/db_growth.html

but how much isn’t
shared?

what isn’t shared?
who isn’t sharing it?
why not?
how much does it matter?
what can we do
about it?

you can not manage
what you do not measure

http://www.flickr.com/photos/archeon/2941655917/

http://www.flickr.com/photos/archeon/2941655917/

Related research
Data usually collected via surveys
and/or manual audits

http://www.flickr.com/photos/jima/606588905/

Models of data and knowledge
sharing

Andriessen. Conditions for the willingness to share knowledge, 2006.

Cabrera and Cabrera. Int J of HR Mgmt. 2005.

Limitations of the related research
• manual audits: small sample sizes

• surveys: few variables + self-reporting bias
• not much focus on measuring demonstrated behavior
• not much focus on rewards
• not much focus on policy
• not much focus on biomedical data other than
DNA sequences

Needed:
a study of data sharing behaviour and impact
that includes

• a measurement of demonstrated behavior
• policy variables
• estimate of rewards
• a broad and deep selection of data creation instances

Aim 1: Does sharing have beneﬁt for
those who share?

Aim 2: Can sharing and withholding be
systematically measured?

Aim 3: How often is data shared?
What predicts sharing?
How can we model sharing behavior?

Scope of proposed study

studies
Published studies with English full text available in
a centralized portal

variables for examination
extracted from Medline and other sources

Microarray data

http://en.wikipedia.org/wiki/DNA_microarray
http://en.wikipedia.org/wiki/Image:Heatmap.png
http://commons.wikimedia.org/wiki/
File:DNA_double_helix_vertikal.PNG

http://farm3.static.ﬂickr.com/2146/2389590651_9bbcc9d07e.jpg

Aim 1: Does sharing have beneﬁt
for those who share?

http://www.flickr.com/photos/sunrise/35819369/


Beneﬁt of value: Citations.

dataset
85 cancer microarray trials published in 1999-2003,
as identiﬁed by Ntzani and Ioannidis (2003)

citations
ISI Web of Science Citation index, citations from
2004-2005

data sharing locations
Publisher and lab websites, microarray databases,
WayBack Internet Archive, Oncomine

statistics
Multivariate linear regression


Note the
logarithmic
scale

In multivariate regression, we found studies that had
made their data publicly available received 69% more
citations than similar studies that did not share their
data (95% conﬁdence interval: 18% to 143%)

Piwowar, Day and Fridsma (2007) Sharing Detailed
Research Data Is Associated with Increased Citation
Rate. PLoS ONE 2(3): e308

Aim 1 conclusion: data sharing has a
beneﬁt for sharers

Next: What factors predict sharing?

http://www.ﬂickr.com/photos/ryanr/142455033/


Can I use the same methods of Aim 1
to choose studies and determine data sharing status?



Can I use the same methods of Aim 1
to choose studies and determine data sharing status?

No, those methods donʼt scale to identify or classify
enough datapoints.


Need automated methods to:

Identify studies that generate datasets that
could potentially be shared (Aim 2a)

Determine which of these have in fact been
shared (Aim 2b)

Aim 2a: Identify studies that create
gene expression microarray data

http://www.ﬂickr.com/photos/lofaesofa/248546821/

Easy, via MeSH indexing terms?

gene expression proﬁling and/or
microarray analysis

Unfortunately, has neither high recall nor precision.

Instead, look for wetlab methods in full text:

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1522022&tool=pmcentrez
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1482311&tool=pmcentrez#id331936
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=126870&tool=pmcentrez#id442745

And query the full text through full-text query portals:

query development
Use supervised natural language processing
techniques on a corpus of Open Access articles

query evaluation
400 studies that created gene expression microarray
data, as identiﬁed by Ochsner et al (2008)

goal
>90% precision, and sufﬁcient recall to retrieve >1250
articles

Aim 2b: Identify studies that share
their expression microarray data

http://www.ﬂickr.com/photos/dcassaa/422261773/


pmc_gds[ﬁlter]

+ text processing on
ArrayExpress website

Enough? Unbiased?

reference standard
200 the 400 studies that created gene expression
microarray data have shared their microarray data, as
identiﬁed by Ochsner et al (2008)

goal
Establish that ﬁlter has >70% recall with an unbiased
representation of MeSH terms, dataset size, and
dataset species

Aim 3 – How often is data shared?
What predicts sharing?
How can we model sharing behavior?


Aim 3a: Prevalence of data sharing


PubMed Created
Portal
ID data?
234 PMC Yes
345 HighPr Yes
456 Scirus Yes
567 PMC Yes
678 PMC Yes
789 HighPr No
890 PMC No
901 ‐ ?


PubMed Created
Portal
ID data?
234 PMC Yes
345 HighPr Yes
456 Scirus Yes
567 PMC Yes
678 PMC Yes


PubMed Created Shared
Portal
ID data? data?
234 PMC Yes Yes
345 HighPr Yes Yes
456 Scirus Yes Yes
567 PMC Yes NO
678 PMC Yes NO


Portal
ID data? data?
234 PMC Yes Yes
345 HighPr Yes Yes
456 Scirus Yes Yes
567 PMC Yes NO
678 PMC Yes NO

Prevalence = Number with Shared data
Number with Created data

Aim 3b: Correlates with data sharing

Covariates

Portal
ID data? data?
234 PMC Yes Yes
345 HighPr Yes Yes
456 Scirus Yes Yes
567 PMC Yes NO
678 PMC Yes NO

Features to include:
• Does the journal have a data sharing policy?
• Is the study funded by the NIH?
• Is it subject tot the NIH data sharing plan
requirement?
• Number of authors
• Journal impact factor
• Are the experimental samples from humans?
• Disease of study
• Year of publication
• …

Covariates

PubMed Created Shared Journal NIH #
Portal ...
ID data? data? policy funds? authors
234 PMC Yes Yes strong yes 2
345 HighPr Yes Yes weak yes 5
456 Scirus Yes Yes weak no 6
567 PMC Yes NO strong yes 5
678 PMC Yes NO strong no 2


Univariate odds ratios
Multivariate logistic regression

Covariates

Portal ...

Journal policy? NIH funded? # authors ...

Shared data?

Aim 3c: Model of data sharing

Covariates

Portal ...


Exploratory factor analysis

Covariates

Portal ...

Mandates Amount of
Collaboration
Shared data? ...

Covariates

Portal ...

Mandates Amount of
Weak Collaboration
Strong Shared data? ...

http://www.ﬂickr.com/photos/donjuanna/322798429/

Limitations
• Association does not imply causation
• Important inﬂuences will be missed due to focus on
measurable variables

• Some derived variables involve many estimates and
assumptions

• Only considering public sharing in primary
centralized databases

• Only one datatype
• Only research studies made available in full-text
portals

Risks and contingency plans
NLP performance may be inadequate
supplement with manual annotating via Mechanical Turk

Author ambiguity may introduce extreme outliers
use Author-ity (Smalheiser and Torvik, 2005) for name
disambiguation

Unable to derive a robust exploratory factor model
try other clustering techniques

Several variables may be unexpectedly difﬁcult to
extract and cross-references
if not essential, defer analysis of that variable

Current status
Aim 1: Does sharing have beneﬁt for
those who share?

Aim 2: Can sharing and withholding be
systematically measured?

ete d.
Aim 3: How often is data shared? ction
om pl
lot c
What predicts sharing? set c olle
pi ata behavior?
ll d
How can we model sharing
fu
No w:

Anticipated contributions
• Published assessment of the observed and
measured rewards, prevalence, and patterns of
gene expression microarray dataset sharing

• Publicly available dataset associating microarray
study publications with data sharing status

• Generalizable approach for developing practical,
real-world information retrieval using
centralized full-text query portals

• Preliminary model of data sharing behaviour
based on this large dataset

Future work

• Identify and model data reuse
• Citation analysis of the large cohort
• Supplement with survey responses

http://www.flickr.com/photos/cogdog/123072/

Data sharing plan

I post my data, code, and statistical scripts at
http://www.dbmi.pitt.edu/piwowar
Share yours too!

http://www.flickr.com/photos/myklroventine/892446624/

Thanks to:
➡ the NLM for funding training grant 5 T15 LM007059-22
➡ the Dept of Biomedical Informatics at the U of Pittsburgh
➡ my committee

Dr Wendy Chapman Biomed Informatics
Dr Ellen Detlefsen iSchool
Dr Madhavi Ganapathiraju Bioinformatics
Dr Brian Butler Katz School of Business
Dr Gunther Eysenbach U of Toronto, Health Policy
Mgmt and Evaluation

Funder Journal Investigator Institution Study

Is research data shared
after publication?

aim

Prevalence of data withholding
via surveys
self-reported denying a request in last 3 years

trainees self-reported denying a request

been denied access to data, materials, code

authors “not able to retrieve raw data”

not willing to release data

0% 10% 20% 30% 40%

Campbell et al. JAMA. 2002.
Kyzas et al. J Natl Cancer Inst. 2005.
Vogeli et al. Acad Med. 2006.
Reidpath et al. Bioethics 2001.

Self‐reported reasons for data
withholding
sharing is too much effort
want student or jr faculty to publish more
they themselves want to publish more
cost
industrial sponsor
conﬁdentiality
commercial value of results
0% 20% 40% 60% 80%

Campbell et al. JAMA 2002.

Correlates with self‐reported data
withholding
industry involvement
perceived competitiveness of ﬁeld
male
sharing discouraged in training
human participants
academic productivity
0 1 2 3

Blumenthal et al. Acad Med. 2006

Thesis Proposal Piwowar Presentation 20091109

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Thesis Proposal Piwowar Presentation 20091109

Similar to Thesis Proposal Piwowar Presentation 20091109 (20)

More from Heather Piwowar

More from Heather Piwowar (20)

Recently uploaded

Recently uploaded (20)

Thesis Proposal Piwowar Presentation 20091109