NESCent visit:  Measuring progress toward a cultural norm of shared (and reused!) biomedical research data
Upcoming SlideShare
Loading in...5
×
 

NESCent visit: Measuring progress toward a cultural norm of shared (and reused!) biomedical research data

on

  • 1,679 views

Preliminary work and future directions in measuring biomedical research data sharing

Preliminary work and future directions in measuring biomedical research data sharing

Statistics

Views

Total Views
1,679
Views on SlideShare
1,679
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

NESCent visit:  Measuring progress toward a cultural norm of shared (and reused!) biomedical research data NESCent visit: Measuring progress toward a cultural norm of shared (and reused!) biomedical research data Presentation Transcript

  • Measuring progress toward a cultural norm of shared (and reused!) biomedical research data Heather Piwowar Department of Biomedical Informatics University of Pittsburgh
  • Sharing research data http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  • Sharing research data http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  • Sharing research data PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  • Sharing research data PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  • Sharing research data PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  • Sharing research data PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  • Shared data benefits science Verify Understand Extend Explore Combine Synergize Train Reduce
  • But... costly for authors Find Organize Document Deidentify Format Decide Ask Submit Answer questions Worry about mistakes being found Worry about data being misinterpreted Worry about being scooped Forgo money and IP and prestige???
  • As a result, policy makers have spent  lots of time and money .... http://www.flickr.com/photos/johnnyvulkan/381941233/ http://www.flickr.com/photos/tonivc/2283676770/
  • ... on initiatives, requests,  requirements, and tools NIH data sharing plan requirement Journal requirements Public databases Data sharing grids like BIRN and caBIG Data formatting standards Editorials, letters to the editor, discussion....
  • http://www.flickr.com/photos/mesh/14102209/
  • lots of data sharing! http://www.genome.jp/en/db_growth.html
  • but how much isn’t  shared? what isn’t shared? who isn’t sharing it? why not? how much does it matter? what can we do  about it?
  • you can not manage  what you do not measure http://www.flickr.com/photos/archeon/2941655917/
  • research questions 1. Is there benefit for those who share? 2. Do journal policies increase rates of sharing? 3. What other factors are correlated with sharing and withholding data?
  • http://en.wikipedia.org/wiki/DNA_microarray http://en.wikipedia.org/wiki/Image:Heatmap.png http://commons.wikimedia.org/wiki/ File:DNA_double_helix_vertikal.PNG microarray data
  • microarray data
  • 1. Is there benefit for  those who share? http://www.flickr.com/photos/sunrise/35819369/
  • currency of value? Citations. $50! Diamond,Arthur M. What is a Citation Worth?. The Journal of Human Resources (1986) vol. 21 (2) pp. 200-215
  • Prior work focused on the citation advantage of an open access publishing model. Our question: are articles that share their raw research data cited more than articles that don’t?
  • dataset 85 cancer microarray trials published in 1999-2003, as identified by Ntzani and Ioannidis (2003) citations ISI Web of Science Citation index, citations from 2004-2005 data sharing locations Publisher and lab websites, microarray databases, WayBack Internet Archive, Oncomine statistics Multivariate linear regression
  • Note: log scale
  • In multivariate regression, we found studies that had made their data publicly available received 69% more citations than similar studies that did not share their data (95% confidence interval: 18% to 143%) Piwowar, Day and Fridsma (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308
  • future work • collect a larger dataset for citation analysis (stay tuned) • investigate other datatypes • examine citation context
  • 2. Do journal data sharing  policies increase sharing? http://www.flickr.com/photos/ryanr/142455033/
  • “An inherent principle of publication is that others should be able to replicate and build upon the authors' published claims. Therefore, a condition of publication in a Nature journal is that authors are required to make materials, data and associated protocols available in a publicly accessible database …” http://www.nature.com/authors/editorial_policies/availability.html http://www.nature.com/nature/journal/v453/n7197/index.html
  • Prior work examined data sharing policies in biomedicine, but these reviews are now dated, consider a variety of resources, and don’t correlate policy to behaviour. McCain. Science Communication, Vol. 16, No. 4. (1 June 1995), pp. 403-431 NAS. Sharing Publication-Related Data and Materials. (2003), p. 33
  • Our aim: look at data sharing policies within Instruction to Author statements of 70 journals, as they apply to gene expression microarray data.
  • content of data sharing policies Very diverse policies in terms of: • statements of policy motivation • datatype-specific policies • requested vs. required • data location • data format • data completeness • timeliness of sharing • consequences for not sharing • exceptions
  • strength of data sharing policies No applicable policy (43%) Weak policy (24%) should, recommend, request must, but without database accession number Strong policy (33%) must, required, condition of publication requires database accession number
  • strength of data sharing policies multivariate associations •! Biochemistry &Molecular Biology Impact Open Society •! Oncology Factor Access? Publisher? Journal has a data sharing policy?
  • strength of data sharing policies associated with impact factor High-impact journals tend to have a strong data-sharing policy
  • data sharing policies associated with amount of sharing For each of the 70 journals, we measured the percent of articles that were cited from within GEO and ArrayExpress. We considered this a proxy for percent of articles with shared data.
  • data sharing policies associated with amount of sharing Having a data-sharing policy? •! Genetics & Heredity Impact Open Society •! Multidisciplinary Factor Access? Publisher? Sciences % of articles with shared data
  • • our corpus of “gene expression microarray” articles may have included some that reused data and did not themselves produce primary data • these results should be considered preliminary, pending a more precise filter (stay tuned) http://www.flickr.com/photos/vlastula/300102949/
  • future work on journal policies • use a more precise filter to isolate data producing articles and thereby understand the absolute levels of data sharing • investigate other datatypes • look at associations with reviewer instructions and opinions
  • future work on funder policies • are they effective? (stay tuned) • what do people propose in data sharing plans? Do they do what they propose? Why not? • quantify the perceived worth of data sharing plans and accomplishments in funding and promotion decisions
  • 3. What other factors are  correlated with sharing  and withholding data? http://www.flickr.com/photos/cogdog/123072/
  • Prior work has focused on surveys and studies of intention. Our aim: measure associations between observed data sharing behaviour and environmental variables Blumenthal et al. Acad Med. 2006 Campbell et al. JAMA. 2002 Kyzas et al. J Natl Cancer Inst. 2005 Vogeli et al. Acad Med. 2006 Reidpath et al. Bioethics 2001
  • pilot dataset Ochsner et al. manually reviewed 20 journals for 2007: 400 studies 200 shared their microarray data Ochsner et al. (2008). Much room for improvement in deposition rates of expression microarray datasets. Nature Methods, 5(12), 991.
  • pilot variables Journal Funder Journal Investigator impact mandates mandates “experience” factor Is research data shared after publication?
  • funder mandates NIH 2003 Data Sharing Requirement Requires a data sharing plan for studies funded after October 2003 that receive more than $500 000 in direct funding per year
  • funder mandates Assumed data sharing requirement was applicable if: the NIH grant numbers associated with PubMed entry had $750 000 in total funding any year since 2004 plus a NIH grant number with a leading “1” or “2” since 2004
  • author experience Publication history and impact proxy First and last authors: • years since first paper • h-index (the largest number N such that an author has N papers cited at least N times) • a-index
  • author experience Derived h-index (pubmedi citation indices): Author publication history: Author name Author-ity web service Torvik & Smalheiser. (2009). Author Name disambiguation: Disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3):11. Citation counts:
  • pilot variables Journal Funder Journal Investigator impact mandates mandates “experience” factor Is research data shared after publication?
  • stats Univariate odds ratios Multivariate logistic regression
  • results of pilot Not statistically significant Statistically significant Journal Funder Journal Investigator impact mandates mandates “experience” factor Is research data shared after publication?
  • results of pilot 33%
  • results of pilot
  • results of pilot
  • results of pilot
  • results of pilot
  • results of pilot
  • results of pilot
  • PhD dissertation More samples, more variables http://www.flickr.com/photos/krcla/2069243613/
  • More samples: Developed and evaluated automated methods to: • Identify studies that generate datasets that could potentially be shared • Determine which of these have in fact been shared
  • To identify studies that generate datasets, use a query on the full text of published articles: ("gene expression" AND microarray AND cell AND rna) AND (rneasy OR trizol OR "real-time pcr") NOT (“tissue microarray*” OR “cpg island*”)
  • To determine which articles have shared data, use a query on the full text of published articles: pubmed_gds[filter] and query ArrayExpress
  • More variables: Use PubMed and a variety of other internet resources...
  • Funder Journal Investigator Institution Study funded by impact years since sector humans? NIH? factor first paper size mice? size of strength of h-index grant policy impact plants? a-index rank sharing open cancer? plan req’d? access? previously country shared? clinical funded by number of trial? non-NIH? microarray previously reused? number of studies authors published gender year
  • stats Univariate odds ratios Multivariate logistic regression Exploratory factor analysis
  • results? http://www.flickr.com/photos/skrb/2427171774/
  • research questions 1. Is there benefit for those who share? 2. Do journal policies increase rates of sharing? 3. What other factors are correlated with sharing and withholding data?
  • what’s next?
  • future work previously mentioned... • citation analysis of larger cohort • journal policies with refined filter • beyond microarray data • deeper into journal and funder policies • and, finally....
  • Reuse. http://www.flickr.com/photos/boitabulle/3668162701/
  • who reuses data? why? when? who doesn’t? which datasets are most likely  to be reused? how many datasets could be  reused but aren’t? why aren’t they? what can we do  about it?
  • One possible reuse research agenda 1. Inventory reuse acknowlegement patterns 2. Build full-text and metadata filters to identify instances of data reuse 3. Analyze patterns in data reuse choices 4. Survey data producers and data consumers to augment with intentions and perspectives
  • Resources • GEO list of reuse articles (currently 618) • Previous work in citation context classification • Amazon Mechanical Turk for annotation • Experimental Philosophy for insight into cultural norms • ... Teufel et al. (2006) Automatic classification of citation function. EMNLP.
  • Stakeholders • readers • reusers For their perspectives, • authors and also to design studies that have actionable results for these groups • editors • reviewers • funders • database designers, maintainers, curators • patients, subjects, or populations
  • Data sharing plan I post my data, code, and statistical scripts at http://www.dbmi.pitt.edu/piwowar Share yours too! http://www.flickr.com/photos/myklroventine/892446624/
  • Dept of Biomedical Informatics at U of Pittsburgh NLM for training grant funding Open science online community and those who release their articles, datasets and photos openly Dr Wendy Chapman for her support and feedback thank you
  • “Does anyone want your data? That’s hard to predict[…] After all, no one ever knocked on your door asking to buy those figurines collecting dust in your cabinet before you listed them on eBay. Your data, too, may simply be awaiting an effective matchmaker.” Got data? Nature Neuroscience 10, 931 (2007)
  • Journal mandates variables
  • Correlates with self‐reported data  withholding industry involvement perceived competitiveness of field male sharing discouraged in training human participants academic productivity 0 1 2 3 Blumenthal et al. Acad Med. 2006
  • Self‐reported reasons for data  withholding sharing is too much effort want student or jr faculty to publish more they themselves want to publish more cost industrial sponsor confidentiality commercial value of results 0% 20% 40% 60% 80% Campbell et al. JAMA 2002.
  • Prevalence of data withholding  via surveys self-reported denying a request in last 3 years trainees self-reported denying a request been denied access to data, materials, code authors “not able to retrieve raw data” not willing to release data 0% 10% 20% 30% 40% Campbell et al. JAMA. 2002. Kyzas et al. J Natl Cancer Inst. 2005. Vogeli et al. Acad Med. 2006. Reidpath et al. Bioethics 2001.