Foundational
studies
for

  measuring
the
impact,

prevalence,
and
patterns
of

publicly
sharing
biomedical

       resear...
Sharing
research
data




http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki...
Sharing
research
data




http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki...
Sharing
research
data




         PAST MEDICAL HISTORY:
         Past medical history showed she had
         superficial...
Sharing
research
data




         PAST MEDICAL HISTORY:
         Past medical history showed she had
         superficial...
Sharing
research
data




         PAST MEDICAL HISTORY:
         Past medical history showed she had
         superficial...
Sharing
research
data




         PAST MEDICAL HISTORY:
         Past medical history showed she had
         superficial...
Shared
data
benefits
science
 Verify
 Understand
 Extend
 Explore
 Combine
 Synergize
 Train
 Reduce
But...
costly
for
authors
    Find
    Organize
    Document
    Deidentify
    Format
    Decide
    Ask
    Submit

    ...
As
a
result,
policy
makers
have
spent

 lots
of
time
and
money
....




                      http://www.flickr.com/photos...
...
on
initiatives,
requests,

  requirements,
and
tools
     NIH data sharing plan requirement

     Journal requirements...
http://www.flickr.com/photos/mesh/14102209/
lots
of
data
sharing!




                        http://www.genome.jp/en/db_growth.html
but
how
much
isn’t

 shared?

  what
isn’t
shared?
              who
isn’t
sharing
it?
why
not?
     how
much
does
it
matt...
you
can
not
manage

what
you
do
not
measure




               http://www.flickr.com/photos/archeon/2941655917/
Long-term motivation:

I believe that analysis of the impact,
prevalence, and patterns with which
investigators share and ...
Aim
1:

Does
sharing
have
benefit
for
those

who
share?

Aim
2:

Can
sharing
and
withholding
be

systematically
measured?
...
Related
research
Data usually collected via surveys
 and/or manual audits




                            http://www.flick...
Prevalence
of
data
sharing

via
manual
audit

              DNA sequences

   gene expression microarrays

           prot...
Prevalence
of
data
withholding

via
surveys
 self-reported denying a request in last 3 years

      trainees self-reported...
Self‐reported
reasons
for
data

withholding
               sharing is too much effort
want student or jr faculty to publis...
Correlates
with
self‐reported
data

withholding
            industry involvement
perceived competitiveness of field
       ...
Models
of
data
and
knowledge

 sharing
Andriessen. Conditions for the willingness to share knowledge, 2006.
Harder.   SMG WP 6/2008 .
Cabrera and Cabrera. Int J of HR Mgmt. 2005.
Kuo. JASIST. 2008.
Limitations
of
the
related
research
 • manual audits:   small sample sizes

 • surveys: few variables + self-reporting bia...
Needed:
 a study of data sharing behavior and impact
 that includes

 • a measurement of demonstrated behavior
 • policy v...
Aim
1:

Does
sharing
have
benefit
for
those

who
share?

Aim
2:

Can
sharing
and
withholding
be

systematically
measured?
...
Scope
of
current
study
•
 type
of
data:

gene
expression
microarrays
•
 sharing
mechanism:
centralized
databases

•
 studi...
Preliminary
research
http://farm3.static.flickr.com/2146/2389590651_9bbcc9d07e.jpg
Aim
1
Aim
1:

Does
sharing
have
benefit

 for
those
who
share?




                     http://www.flickr.com/photos/sunrise/358...
Aim
1:

Does
sharing
have
benefit

 for
those
who
share?
Aim
1:

Does
sharing
have
benefit

 for
those
who
share?
Aim
1:

Does
sharing
have
benefit

   for
those
who
share?



Note the
 logarithmic
 scale
Aim
1:

Does
sharing
have
benefit

 for
those
who
share?
Aim
1:
Associated
citation
increase




                     http://www.flickr.com/photos/sunrise/35819369/
Next:
What
factors
predict
sharing?




                       http://www.flickr.com/photos/ryanr/142455033/
Can
I
use
the
same
methods
of
Aim
1

to
choose
studies
and
determine
data

sharing
status?
Can
I
use
the
same
methods
of
Aim
1

to
choose
studies
and
determine
data

sharing
status?
No,
those
methods
don’t
scale
t...
Aim
2
Need
automated
methods
to:
Identify
studies
that

generate
datasets
that
could

potentially
be
shared
(Aim
2a)
Determine
w...
Aim
2a:
Identify
studies
that
create

gene
expression
microarray
data




                        http://www.flickr.com/pho...
Aim
2a:
Identify
studies
that
create

gene
expression
microarray
data
   Easy,
via
MeSH
indexing
terms?
    gene
expressio...
Look
for
wetlab
methods
in
full
text:




                        http://www.pubmedcentral.nih.gov/articlerender.fcgi?arti...
BUT
this
requires
developing
and

maintaining
a
full‐text
archive!
What
about
using
PubMed
Central?
Can
reach
~85%
of
articles
with
full‐text
links

via
U
of
Pittsburgh
library
subscriptions,
when
combined
with
two
other
f...
Aim
2a:
Identify
studies
that
create

gene
expression
microarray
data
   Derive
a
full‐text
query
with
suffiently
high

  ...
Aim
2a:
Identify
studies
that
create

gene
expression
microarray
data
   Reference
standard?
   Ochsner
et
al.
    •2007
 ...
Aim
2a:
Identify
studies
that
create

gene
expression
microarray
data
   Development
corpus?
    


PubMed
Central
Open
Ac...
Aim
2a:
Identify
studies
that
create

gene
expression
microarray
data
   Development
approach?
    •Pattern
building
via
m...
Aim
2b
Aim
2b:
Identify
studies
that
share

their
expression
microarray
data




                        http://www.flickr.com/pho...
Aim
2b:
Identify
studies
that
share

their
expression
microarray
data
Aim
2b:
Identify
studies
that
share

their
expression
microarray
data
Aim
2b:
Identify
studies
that
share

their
expression
microarray
data
Aim
2b:
Identify
studies
that
share

their
expression
microarray
data



             pmc_gds[filter]
Aim
2b:
Identify
studies
that
share

their
expression
microarray
data

  Unfortunately,
the
submission
citation
is

  ofte...
Aim
2b:
Identify
studies
that
share

their
expression
microarray
data
  To
acheive
70%
recall,
I
may
have
to

  supplement...
Aim
2b:
Identify
studies
that
share

their
expression
microarray
data
  Reference
standard:
Aim
3
Aim
3
–
How
often
is
data
shared?

What
predicts
sharing?

How
can
we
model
sharing
behavior?




                      ht...
Aim
3a:

Prevalence
of
data
sharing
Aim
3a:

Prevalence
of
data
sharing

PubMed
        Created

        Portal
  ID            data?
  234    PMC     Yes
  3...
Aim
3a:

Prevalence
of
data
sharing

PubMed
        Created

        Portal
  ID            data?
  234    PMC     Yes
  3...
Aim
3a:

Prevalence
of
data
sharing

PubMed
        Created

        Portal
  ID            data?
  234    PMC     Yes
  3...
Aim
3a:

Prevalence
of
data
sharing

PubMed
        Created
 Shared

        Portal
  ID            data? data?
  234    P...
Aim
3a:

Prevalence
of
data
sharing

PubMed
        Created
 Shared

        Portal
  ID            data? data?
  234    P...
Aim
3b:

Correlates
with
data
sharing
Aim
3b:

Correlates
with
data
sharing
                                  Covariates

PubMed
        Created
 Shared

      ...
Aim
3b:

Correlates
with
data
sharing
   Features to include:
     • Does the journal have a data sharing policy?
     • I...
Aim
3b:

Correlates
with
data
sharing
                                 Covariates

PubMed
        Created
 Shared
 Journal...
Aim
3b:

Correlates
with
data
sharing
                                   Covariates

PubMed
        Created
 Shared
 Journ...
Aim
3c:
Model
of
data
sharing
Aim
3c:
Model
of
data
sharing
                                 Covariates

PubMed
        Created
 Shared
 Journal
 NIH
 #...
Aim
3c:
Model
of
data
sharing
                                    Covariates

PubMed
        Created
 Shared
 Journal
 NIH...
Aim
3c:
Model
of
data
sharing
                                    Covariates

PubMed
        Created
 Shared
 Journal
 NIH...
http://www.flickr.com/photos/rachynymph/2930626195/
Assumptions

  That the following limitations are randomly distributed:
  • Ambiguous author names
  • The method of descr...
Limitations
  Association does not imply causation

  Only one datatype: microarray data.

  Only considering sharing in t...
Risks
and
contingency
plans
    NLP performance may be inadequate
    supplement with manual annotating via Mechanical Tur...
Contributions
  • an assessment of the observed and measured
     rewards, prevalence, and patterns of gene
     expressio...
Publication
plan




                   http://www.flickr.com/photos/linkwize/926334421/
Publication
plan:
Aim
1

   Do studies with publicly shared datasets receive
    more citations?

   Published in PLoS ONE...
Publication
plan:
Aim
2a
   How can we identify studies that generate
    certain data, given full-text query access
    t...
Publication
plan:
Aim
2b,
3a,
3b

   What factors are associated with demonstrated
    data sharing behavior?


   Targete...
Publication
plan:
Aim
3c
    Derive (and validate?) a preliminary a model of
     demonstrated research data sharing behav...
Future
work



  1. Identify and model data reuse
  2. Citation analysis of the large cohort
  3. Supplement with survey r...
Data
sharing
plan




    I plan to share my code, data, and process openly
       during the research via blogs and repos...
Thanks to
 the Dept of Biomedical Informatics at the U of Pittsburgh,

 the NLM for funding through training grant 5 T15 L...
Future
work
Audience
•   Funders, policy makers and thought leaders.
•   Database, software, and data standard
    developers.
•   Bio...
Recent related grants
NIH: Haga, S.
Exploring Attitudes About Data Disclosure and Data-Sharing
in Genomics Research.
NSF: ...
Thesis Proposal, as presented for dissertation proposal defense
Thesis Proposal, as presented for dissertation proposal defense
Thesis Proposal, as presented for dissertation proposal defense
Thesis Proposal, as presented for dissertation proposal defense
Thesis Proposal, as presented for dissertation proposal defense
Thesis Proposal, as presented for dissertation proposal defense
Thesis Proposal, as presented for dissertation proposal defense
Upcoming SlideShare
Loading in...5
×

Thesis Proposal, as presented for dissertation proposal defense

3,351

Published on

The slides I presented for my PhD proposal defense for my project, "Foundational studies for measuring the impact, prevalence, and patterns of publicly sharing biomedical research data." Dept of Biomedical Informatics, University of Pittsburgh.

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,351
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
69
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Thesis Proposal, as presented for dissertation proposal defense

  1. 1. Foundational
studies
for
 measuring
the
impact,
 prevalence,
and
patterns
of
 publicly
sharing
biomedical
 research
data Heather
Piwowar Department
of
Biomedical
Informatics University
of
Pittsburgh
  2. 2. Sharing
research
data http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  3. 3. Sharing
research
data http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  4. 4. Sharing
research
data PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  5. 5. Sharing
research
data PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  6. 6. Sharing
research
data PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  7. 7. Sharing
research
data PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  8. 8. Shared
data
benefits
science Verify Understand Extend Explore Combine Synergize Train Reduce
  9. 9. But...
costly
for
authors Find Organize Document Deidentify Format Decide Ask Submit Answer questions Worry about mistakes being found Worry about data being misinterpreted Worry about being scooped Forgo money and IP and prestige???
  10. 10. As
a
result,
policy
makers
have
spent
 lots
of
time
and
money
.... http://www.flickr.com/photos/johnnyvulkan/381941233/ http://www.flickr.com/photos/tonivc/2283676770/
  11. 11. ...
on
initiatives,
requests,
 requirements,
and
tools NIH data sharing plan requirement Journal requirements Databases Data sharing grids like BIRN and caBIG Standards Editorials, letters to the editor, discussion....
  12. 12. http://www.flickr.com/photos/mesh/14102209/
  13. 13. lots
of
data
sharing! http://www.genome.jp/en/db_growth.html
  14. 14. but
how
much
isn’t
 shared? what
isn’t
shared? who
isn’t
sharing
it? why
not? how
much
does
it
matter? what
can
we
do
 about
it?
  15. 15. you
can
not
manage
 what
you
do
not
measure http://www.flickr.com/photos/archeon/2941655917/
  16. 16. Long-term motivation: I believe that analysis of the impact, prevalence, and patterns with which investigators share and withhold gene expression microarray research data can uncover rewards, best practices, and opportunities for increased adoption of data sharing.
  17. 17. Aim
1:

Does
sharing
have
benefit
for
those
 who
share? Aim
2:

Can
sharing
and
withholding
be
 systematically
measured?
 Aim
3:

How
often
is
data
shared?

 What
predicts
sharing?

 How
can
we
model
sharing
behavior?
  18. 18. Related
research Data usually collected via surveys and/or manual audits http://www.flickr.com/photos/jima/606588905/
  19. 19. Prevalence
of
data
sharing
 via
manual
audit DNA sequences gene expression microarrays proteomics spectra 0% 25% 50% 75% 100% Noor et al. PLoS Biology 2006. Ochsner et al. Nature Methods 2008. Piwowar et al. PLoS ONE 2007. Editorial. Nature Biotech 2007.
  20. 20. Prevalence
of
data
withholding
 via
surveys self-reported denying a request in last 3 years trainees self-reported denying a request been denied access to data, materials, code authors “not able to retrieve raw data” not willing to release data 0% 10% 20% 30% 40% Campbell et al. JAMA. 2002. Kyzas et al. J Natl Cancer Inst. 2005. Vogeli et al. Acad Med. 2006. Reidpath et al. Bioethics 2001.
  21. 21. Self‐reported
reasons
for
data
 withholding sharing is too much effort want student or jr faculty to publish more they themselves want to publish more cost industrial sponsor confidentiality commercial value of results 0% 20% 40% 60% 80% Campbell et al. JAMA 2002.
  22. 22. Correlates
with
self‐reported
data
 withholding industry involvement perceived competitiveness of field male sharing discouraged in training human participants academic productivity 0 1 2 3 Blumenthal et al. Acad Med. 2006
  23. 23. Models
of
data
and
knowledge
 sharing
  24. 24. Andriessen. Conditions for the willingness to share knowledge, 2006.
  25. 25. Harder. SMG WP 6/2008 .
  26. 26. Cabrera and Cabrera. Int J of HR Mgmt. 2005.
  27. 27. Kuo. JASIST. 2008.
  28. 28. Limitations
of
the
related
research • manual audits: small sample sizes • surveys: few variables + self-reporting bias • not much focus on measuring demonstrated behavior • not much focus on impact or policy • not much focus on biomedical data other than DNA sequences
  29. 29. Needed: a study of data sharing behavior and impact that includes • a measurement of demonstrated behavior • policy variables • estimate of rewards • a broad and deep selection of data creation instances • a focus on biomedical data other than DNA sequences
  30. 30. Aim
1:

Does
sharing
have
benefit
for
those
 who
share? Aim
2:

Can
sharing
and
withholding
be
 systematically
measured?
 Aim
3:

How
often
is
data
shared?

 What
predicts
sharing?

 How
can
we
model
sharing
behavior?
  31. 31. Scope
of
current
study •
 type
of
data:

gene
expression
microarrays •
 sharing
mechanism:
centralized
databases
 •
 studies:

English
full
text
available
in
a
centralized
portal •
 covariates:
extracted
from
Medline
and
database
sources http://en.wikipedia.org/wiki/DNA_microarray http://en.wikipedia.org/wiki/Image:Heatmap.png
  32. 32. Preliminary
research
  33. 33. http://farm3.static.flickr.com/2146/2389590651_9bbcc9d07e.jpg
  34. 34. Aim
1
  35. 35. Aim
1:

Does
sharing
have
benefit
 for
those
who
share? http://www.flickr.com/photos/sunrise/35819369/
  36. 36. Aim
1:

Does
sharing
have
benefit
 for
those
who
share?
  37. 37. Aim
1:

Does
sharing
have
benefit
 for
those
who
share?
  38. 38. Aim
1:

Does
sharing
have
benefit
 for
those
who
share? Note the logarithmic scale
  39. 39. Aim
1:

Does
sharing
have
benefit
 for
those
who
share?
  40. 40. Aim
1:
Associated
citation
increase http://www.flickr.com/photos/sunrise/35819369/
  41. 41. Next: What
factors
predict
sharing? http://www.flickr.com/photos/ryanr/142455033/
  42. 42. Can
I
use
the
same
methods
of
Aim
1 
to
choose
studies
and
determine
data
 sharing
status?
  43. 43. Can
I
use
the
same
methods
of
Aim
1 
to
choose
studies
and
determine
data
 sharing
status? No,
those
methods
don’t
scale
to
identify
or
 classify
enough
datapoints
  44. 44. Aim
2
  45. 45. Need
automated
methods
to: Identify
studies
that
 generate
datasets
that
could
 potentially
be
shared
(Aim
2a) Determine
which
of
these
 have
in
fact
been
shared
(Aim
2b)
  46. 46. Aim
2a:
Identify
studies
that
create
 gene
expression
microarray
data http://www.flickr.com/photos/lofaesofa/248546821/
  47. 47. Aim
2a:
Identify
studies
that
create
 gene
expression
microarray
data Easy,
via
MeSH
indexing
terms? gene
expression
profiling
and/or microarray
analysis Unfortunately,
these
have
neither
high
 recall
nor
precision.
  48. 48. Look
for
wetlab
methods
in
full
text: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1522022&tool=pmcentrez http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1590031&tool=pmcentrez http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1482311&tool=pmcentrez#id331936 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2082469&tool=pmcentrez http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=126870&tool=pmcentrez#id442745
  49. 49. BUT
this
requires
developing
and
 maintaining
a
full‐text
archive!
  50. 50. What
about
using
PubMed
Central?
  51. 51. Can
reach
~85%
of
articles
with
full‐text
links
 via
U
of
Pittsburgh
library
subscriptions, when
combined
with
two
other
full‐text
 query
portals:
  52. 52. Aim
2a:
Identify
studies
that
create
 gene
expression
microarray
data Derive
a
full‐text
query
with
suffiently
high
 recall
(>
1250
studies)
and
precision
(>
70%).

  53. 53. Aim
2a:
Identify
studies
that
create
 gene
expression
microarray
data Reference
standard? Ochsner
et
al. •2007 •20
journals •broad
query
for
microarray
studies •identified
400
studies
that
created
gene
 expression
microarray
data
  54. 54. Aim
2a:
Identify
studies
that
create
 gene
expression
microarray
data Development
corpus? 


PubMed
Central
Open
Access
subset +
TREC
Genomics
IR
subset =
about
5000
relevant
articles 


with
about
50%
true
positive
rate
  55. 55. Aim
2a:
Identify
studies
that
create
 gene
expression
microarray
data Development
approach? •Pattern
building
via
manual
inspection •Classification
decision
trees
with
n‐grams •Borrow
approaches
from •Autoslog‐TS •automated
regular
expression
building •semi‐supervised
learning •retrieval
query
aspects
  56. 56. Aim
2b
  57. 57. Aim
2b:
Identify
studies
that
share
 their
expression
microarray
data http://www.flickr.com/photos/dcassaa/422261773/
  58. 58. Aim
2b:
Identify
studies
that
share
 their
expression
microarray
data
  59. 59. Aim
2b:
Identify
studies
that
share
 their
expression
microarray
data
  60. 60. Aim
2b:
Identify
studies
that
share
 their
expression
microarray
data
  61. 61. Aim
2b:
Identify
studies
that
share
 their
expression
microarray
data pmc_gds[filter]
  62. 62. Aim
2b:
Identify
studies
that
share
 their
expression
microarray
data Unfortunately,
the
submission
citation
is
 often
left
blank
when
data
is
submitted
 prior
to
publication.
  63. 63. Aim
2b:
Identify
studies
that
share
 their
expression
microarray
data To
acheive
70%
recall,
I
may
have
to
 supplement
with
a
query
of
the
full
text,
 such
as: (geo OR omnibus) AND microarray AND "gene expression" AND accession NOT (databases OR user OR users OR (public AND accessed) OR (downloaded AND published))
  64. 64. Aim
2b:
Identify
studies
that
share
 their
expression
microarray
data Reference
standard:
  65. 65. Aim
3
  66. 66. Aim
3
–
How
often
is
data
shared?
 What
predicts
sharing?
 How
can
we
model
sharing
behavior? http://www.flickr.com/photos/ryanr/142455033/
  67. 67. Aim
3a:

Prevalence
of
data
sharing
  68. 68. Aim
3a:

Prevalence
of
data
sharing PubMed
 Created
 Portal ID data? 234 PMC Yes 345 HighPr Yes 456 Scirus Yes 567 PMC Yes 678 PMC Yes 789 HighPr No 890 PMC No 901 ‐ ?
  69. 69. Aim
3a:

Prevalence
of
data
sharing PubMed
 Created
 Portal ID data? 234 PMC Yes 345 HighPr Yes 456 Scirus Yes 567 PMC Yes 678 PMC Yes 789 HighPr No 890 PMC No 901 ‐ ?
  70. 70. Aim
3a:

Prevalence
of
data
sharing PubMed
 Created
 Portal ID data? 234 PMC Yes 345 HighPr Yes 456 Scirus Yes 567 PMC Yes 678 PMC Yes
  71. 71. Aim
3a:

Prevalence
of
data
sharing PubMed
 Created
 Shared
 Portal ID data? data? 234 PMC Yes Yes 345 HighPr Yes Yes 456 Scirus Yes Yes 567 PMC Yes NO 678 PMC Yes NO
  72. 72. Aim
3a:

Prevalence
of
data
sharing PubMed
 Created
 Shared
 Portal ID data? data? 234 PMC Yes Yes 345 HighPr Yes Yes 456 Scirus Yes Yes 567 PMC Yes NO 678 PMC Yes NO Prevalence
=



Number
with
Shared
data Number
with
Created
data
  73. 73. Aim
3b:

Correlates
with
data
sharing
  74. 74. Aim
3b:

Correlates
with
data
sharing Covariates PubMed
 Created
 Shared
 Portal ID data? data? 234 PMC Yes Yes 345 HighPr Yes Yes 456 Scirus Yes Yes 567 PMC Yes NO 678 PMC Yes NO
  75. 75. Aim
3b:

Correlates
with
data
sharing Features to include: • Does the journal have a data sharing policy? • Is the study funded by the NIH? • Number of authors • Research-orientation of the primary institution • Journal impact factor • Are the samples from humans? • Disease of study • Year of publication • …
  76. 76. Aim
3b:

Correlates
with
data
sharing Covariates PubMed
 Created
 Shared
 Journal
 NIH
 #
 Portal ... ID data? data? policy funds? authors 234 PMC Yes Yes strong yes 2 345 HighPr Yes Yes weak yes 5 456 Scirus Yes Yes weak no 6 567 PMC Yes NO strong yes 5 678 PMC Yes NO strong no 2
  77. 77. Aim
3b:

Correlates
with
data
sharing Covariates PubMed
 Created
 Shared
 Journal
 NIH
 #
 Portal ... ID data? data? policy funds? authors 234 PMC Yes Yes strong yes 2 345 HighPr Yes Yes weak yes 5 456 Scirus Yes Yes weak no 6 567 PMC Yes NO strong yes 5 678 PMC Yes NO strong no 2 Journal
policy? NIH
funded? #
authors ... Shared
data?
  78. 78. Aim
3c:
Model
of
data
sharing
  79. 79. Aim
3c:
Model
of
data
sharing Covariates PubMed
 Created
 Shared
 Journal
 NIH
 #
 Portal ... ID data? data? policy funds? authors 234 PMC Yes Yes strong yes 2 345 HighPr Yes Yes weak yes 5 456 Scirus Yes Yes weak no 6 567 PMC Yes NO strong yes 5 678 PMC Yes NO strong no 2
  80. 80. Aim
3c:
Model
of
data
sharing Covariates PubMed
 Created
 Shared
 Journal
 NIH
 #
 Portal ... ID data? data? policy funds? authors 234 PMC Yes Yes strong yes 2 345 HighPr Yes Yes weak yes 5 456 Scirus Yes Yes weak no 6 567 PMC Yes NO strong yes 5 678 PMC Yes NO strong no 2 Mandates Amount
of
 Collaboration Shared
data? ...
  81. 81. Aim
3c:
Model
of
data
sharing Covariates PubMed
 Created
 Shared
 Journal
 NIH
 #
 Portal ... ID data? data? policy funds? authors 234 PMC Yes Yes strong yes 2 345 HighPr Yes Yes weak yes 5 456 Scirus Yes Yes weak no 6 567 PMC Yes NO strong yes 5 678 PMC Yes NO strong no 2 Mandates Amount
of
 Weak Collaboration Strong Shared
data? ...
  82. 82. http://www.flickr.com/photos/rachynymph/2930626195/
  83. 83. Assumptions That the following limitations are randomly distributed: • Ambiguous author names • The method of describing data generation • Studies with data in GEO but no submission links • Studies that don’t mention sharing in the full-text article The first and last authors are usually primary decision- makers about whether to share data Citations are a valued, though imperfect, measure of research impact
  84. 84. Limitations Association does not imply causation Only one datatype: microarray data. Only considering sharing in the primary centralized databases. Many variables are USA-centric. Results will only be generalizable to research studies made available in full-text portals.
  85. 85. Risks
and
contingency
plans NLP performance may be inadequate supplement with manual annotating via Mechanical Turk Author ambiguity may introduce extreme outliers. use Author-ity software on extreme outliers Unable to derive a robust exploratory factor model try other clustering techniques Several variables may be unexpectedly difficult to extract if not essential, defer the analysis of that variable to future work
  86. 86. Contributions • an assessment of the observed and measured rewards, prevalence, and patterns of gene expression microarray dataset sharing • a publicly available dataset associating microarray study publications with data sharing status • a generalizable approach for developing practical, real-world information retrieval using centralized full-text query portals • preliminary models of data sharing behavior
  87. 87. Publication
plan http://www.flickr.com/photos/linkwize/926334421/
  88. 88. Publication
plan:
Aim
1 Do studies with publicly shared datasets receive more citations? Published in PLoS ONE in February 2007
  89. 89. Publication
plan:
Aim
2a How can we identify studies that generate certain data, given full-text query access through centralized portals? Targeted journal: Journal of Medical Internet Research? BMC Bioinformatics? other?
  90. 90. Publication
plan:
Aim
2b,
3a,
3b What factors are associated with demonstrated data sharing behavior? Targeted journal: BMC Bioinformatics? BMC Biology? PLoS Biology? a research policy journal? other?
  91. 91. Publication
plan:
Aim
3c Derive (and validate?) a preliminary a model of demonstrated research data sharing behavior Targeted journal: JASIST? (Journal of the American Society for Information Science and Technology) Information Research? Journal of Documentation? Science Communication? Data Science Journal? other?
  92. 92. Future
work 1. Identify and model data reuse 2. Citation analysis of the large cohort 3. Supplement with survey responses 4. Generalize the method for creating queries for full-text portals http://www.flickr.com/photos/cogdog/123072/
  93. 93. Data
sharing
plan I plan to share my code, data, and process openly during the research via blogs and repositories. http://www.flickr.com/photos/myklroventine/892446624/
  94. 94. Thanks to the Dept of Biomedical Informatics at the U of Pittsburgh, the NLM for funding through training grant 5 T15 LM007059-22, those with photos on Flickr under a Creative Commons license, Wendy for her support and feedback, and my committee for anticipated feedback.... Questions and Suggestions?
  95. 95. Future
work
  96. 96. Audience • Funders, policy makers and thought leaders. • Database, software, and data standard developers. • Biomedical informatics community. • Information science and digital library community. • Open Science community. • Primary Investigators.
  97. 97. Recent related grants NIH: Haga, S. Exploring Attitudes About Data Disclosure and Data-Sharing in Genomics Research. NSF: Hedstrom, M. Incentives for Data Producers to Create Archive-Ready Data Sets. National Inst of Nursing Research: Pienta, A. Barriers and Opportunities for Sharing Research Data. +others
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×