JCDL doctoral consortium 2008: Proposed Foundations for Evaluating Data Sharing and Reuse in the Biomedical Literature

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Notes on slide 1

    Let’s start by putting on a few hats. You’ve just finished analyzing the results of your pioneering clinical trial. You suspect your results are generalizable. It would sure help the impact of your publication if you could include an analysis on an independent population.

    You have a new data mining technique you’d like to try.

    You just read the latest issue of your favorite journal, and one of the results contradicts the research you’ve been doing. Is your research off base? Perhaps the authors made a mistake? You’d like to take a closer look.

    You are a student, and you relish the rich learning experience of applying the new concepts to real data with all of it inherent complexities.

    You are a smart cookie with unique perspective and skills, and you’d like to help advance the research frontier. Unfortunately, you live in a rural area that doesn’t have access to hospitals or wet labs to generate your own data.

    To help achieve these benefits, the science community is spending lots of time and money on databases, initiatives, and policies that encourage people to share their data.

    But is it worth it? Are we realizing the benefits? Could we be doing better?

    We cannot manage what we do not measure. If we believe that data sharing and reuse has value, or even if we think it might, we need to study our current practices and behavior to choose the most effective path for the future.

    To answer that, we need to evaluate.

    Not sure what your data is good for? A Nature editorial this summer. I’ll let you read it. Got data? Nature Neuroscience 10 , 931 (2007)

    I’d like to thank to do thanks times four. To the NLM and JCDL for their generous funding support, my advisors for their insightful feedback, the reviewers for their helpful commends, and last but not least, thanks to each and every one of you who have made your detailed research data available in the past, or will do so the future. It matters.

    Favorites, Groups & Events

    JCDL doctoral consortium 2008: Proposed Foundations for Evaluating Data Sharing and Reuse in the Biomedical Literature - Presentation Transcript

    1. Prevalence and Patterns of Biomedical Research Data Sharing and Reuse Heather Piwowar Department of Biomedical Informatics University of Pittsburgh JCDL Doctoral Consortium June 2008 except clipart
    2. $$
    3. ?
    4. Is it working? Is it worth it? Are scientists sharing their data? Are other scientists reusing the data? Who, if anyone, is benefiting from the policies, tools, and initiatives?
    5. We cannot manage what we do not measure
    6. Dissertation Objective: Evaluate the patterns and prevalence of biomedical research data sharing and reuse
    7. Prior work in this area •  Surveys •  Manual audits •  Automated classification of citation contexts See http://www.citeulike.org/user/hpiwowar for bibliography
    8. Missing: a study of data sharing and reuse behavior and impact based on a broad spectrum of instances
    9. Dissertation Research Questions 1.  prevalence of sharing and reuse 2.  patterns of sharing and reuse 3.  affect of sharing and reuse on impact 4.  implications of findings
    10. Data type •  Gene expression microarrays http://en.wikipedia.org/wiki/DNA_microarray http://en.wikipedia.org/wiki/Image:Heatmap.png
    11. Sharing type •  Openly online, mentioned in publication –  PubMed –  filtered with MeSH terms for gene-expression –  English, machine-readable full text –  2000-2007
    12. Dissertation dataset Article ID Link to full text 234 http://… 456 http://… 657 http://… 897 http://…
    13. 1. What is the prevalence of biomedical research data sharing? of biomedical research data reuse?
    14. Endpoints •  Three endpoints: –  Does this study produce raw data? –  If so, does this study share the raw data? –  Does this study reuse others’ raw data?
    15. Example text cues •  Sharing –  “our data has been deposited in the GEO database” –  “the microarray expression values from this study are available at the following website” •  Reuse –  “using the data of Smith et al, we… ” –  “we downloaded four datasets from…”
    16. Identification of endpoints •  Train and evaluate a Natural Language Processing system to recognize endpoint cues within full text •  Performance to be summarized as precision and recall with confidence intervals Pilot data for NLP identification of sharing: Piwowar and Chapman, submitted to AMIA 2008.
    17. Research Question 1: Prevalence | Endpoints | Article ID Link to full Produces Shares data? Reuses text data? data? 234 http://… TRUE TRUE FALSE 456 http://… TRUE TRUE TRUE 657 http://… TRUE FALSE FALSE 897 http://… FALSE n/a TRUE Calculate Percentages
    18. 2. What features are most associated with an investigator’s decision to share or reuse a biomedical research dataset?
    19. •  Features to include: –  Journal Impact Factor –  Number of Authors –  Are the samples from humans –  Subdiscipline –  Strictness of Journal policy on data sharing –  Institution –  Year of publication –  … Pilot data for journal policies: Piwowar and Chapman, ELPUB 2008.
    20. | Endpoints | | Covariates | Article Link to Produces Shares Reuses Journal # Human ID full data? data? data? Impact Authors Data text Factor 234 http://… TRUE TRUE FALSE 1.5 2 TRUE 456 http://… TRUE TRUE TRUE 23.5 1 FALSE 657 http://… TRUE FALSE FALSE 2.4 6 FALSE 897 http://… FALSE n/a TRUE 0.6 2 TRUE Journal Human Impact Factor # Authors Data … Shares data? Reuses data? Multivariate logistic regressions
    21. 3. Does sharing or reusing data contribute to the impact of a research article, independently of other factors?
    22. Assumption: citation count is a proxy for research impact
    23. | Endpoints | | Covariates | Article … Produces Shares Reuses … … … Number of ID data? data? data? Citations 234 TRUE TRUE FALSE 1.5 2 TRUE 0 456 TRUE TRUE TRUE 23.5 1 FALSE 4 657 TRUE FALSE FALSE 2.4 6 FALSE 4 897 FALSE n/a TRUE 0.6 2 TRUE 5 Journal Human Impact Factor # Authors Data … Shares data? Reuses data? Number of Citations Multivariate linear regressions Pilot data on citation impact for sharing: Piwowar, Day and Fridsma, PLoS ONE 2007.
    24. 4. What do the results suggest for developing efficient, effective policies, tools, and initiatives for promoting data sharing and reuse?
    25. We might discover, for example: •  Lots of sharing for non-human data •  All reuse within the first 5 years •  Journal and funder requests are ineffective
    26. Significance •  Dataset –  social network analysis, simulation, … •  NLP Classifiers •  Best-practice patterns, communities •  Novel research connections •  Inspire further work in this area –  policy evaluation, reusability metrics –  citations for data (Data Reuse Registry) Piwowar, Chapman. Envisioning a Data Reuse Registry. Poster submitted to AMIA 2008.
    27. Limitations •  Causation? •  Other data types? •  Other sharing mechanisms?
    28. “Does anyone want your data? That’s hard to predict[…] After all, no one ever knocked on your door asking to buy those figurines collecting dust in your cabinet before you listed them on eBay. Your data, too, may simply be awaiting an effective matchmaker.” Got data? Nature Neuroscience 10, 931 (2007)
    29. My data is here www.dbmi.pitt.edu/piwowar I urge you to share yours, too.
    30. Thank you Funding: NLM informatics training grant Advisor: Dr. Wendy Chapman Committee: Dr. Ellen Detlefsen Dr. Madhavi Ganapathiraju + JCDL Funding and Reviewers! Questions, Comments, or Suggestions? except clipart

    + hpiwowarhpiwowar, 2 months ago

    custom

    186 views, 0 favs, 0 embeds more stats

    Science progresses by building upon previous resear more

    More info about this document

    CC Attribution License

    Go to text version

    • Total Views 186
      • 186 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 4
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories