• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Thesis defense, Heather Piwowar, Sharing biomedical research data
 

Thesis defense, Heather Piwowar, Sharing biomedical research data

on

  • 4,864 views

Presentation by Heather Piwowar as PhD dissertation defense on March 24, 2010 at the Dept of Biomedical Informatics, U of Pittsburgh. "Foundational studies for...

Presentation by Heather Piwowar as PhD dissertation defense on March 24, 2010 at the Dept of Biomedical Informatics, U of Pittsburgh. "Foundational studies for
measuring the impact, prevalence, and patterns of publicly sharing biomedical research data." I passed :)

Statistics

Views

Total Views
4,864
Views on SlideShare
4,855
Embed Views
9

Actions

Likes
2
Downloads
29
Comments
0

1 Embed 9

http://www.slideshare.net 9

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Thesis defense, Heather Piwowar, Sharing biomedical research data Thesis defense, Heather Piwowar, Sharing biomedical research data Presentation Transcript

    • Foundational studies for  measuring the impact,  prevalence, and patterns of  publicly sharing biomedical  research data Heather Piwowar Doctoral Defense March 24, 2010 Department of Biomedical Informatics University of Pittsburgh
    • Wendy Chapman, PhD Brian Butler, PhD Ellen Detlefsen, DLS Madhavi Ganapathiraju, PhD Gunther Eysenbach, MD, MPH
    • http://www.metmuseum.org/toah/ho/09/euwf/ho_24.45.1.htm
    • http://www.flickr.com/photos/jsmjr/62443357/
    • http://www.flickr.com/photos/camilleharrington/3587294608/
    • http://www.flickr.com/photos/rkuhnau/3318245976/
    • http://www.flickr.com/photos/rkuhnau/3317418699/
    • http://www.flickr.com/photos/zemlinki/261617721/
    • http://www.flickr.com/photos/tracenmatt/3020786491/
    • http://www.flickr.com/photos/conformpdx/1796399674/
    • http://www.flickr.com/photos/the-o/2078239333/
    • lots of data sharing! http://www.genome.jp/en/db_growth.html
    • but how much isn’t  shared? what isn’t shared? who isn’t sharing it? why not? how much does it matter? what can we do  about it?
    • Prior studies: surveys and/or manual audits Blumenthal et al. Acad Med. 2006 Campbell et al. JAMA. 2002. Kyzas et al. J Natl Cancer Inst. 2005. Vogeli et al. Acad Med. 2006. Reidpath et al. Bioethics 2001. http://www.flickr.com/photos/jima/606588905/
    • Limitations of related work • small sample sizes • relatively few variables • self-reporting bias • not much focus on measuring demonstrated behavior • not much focus on rewards • not much focus on policy • not much focus on biomedical data other than DNA sequences
    • I believe analysis of the impact, prevalence, and patterns with which researchers share and withhold biomedical data can uncover rewards, best practices, and opportunities for increased adoption of data sharing. http://www.flickr.com/photos/archeon/2941655917/
    • Goal of this dissertation: Collect useful evidence on patterns of data sharing behaviour through methods that can be applied broadly, repeatably, and cost-effectively.
    • Aim 1:  Does sharing have benefit for those  who share? Aim 2:  Can sharing and withholding be  systematically measured?  Aim 3:  How often is data shared?   What predicts sharing?   How can we model sharing behavior?
    • Scope: • raw research data • upon study publication • making data publicly available on the Internet • one datatype
    • http://en.wikipedia.org/wiki/DNA_microarray http://en.wikipedia.org/wiki/Image:Heatmap.png http://commons.wikimedia.org/wiki/ File:DNA_double_helix_vertikal.PNG microarray data
    • microarray data
    • Aim 1
    • Aim 1:  Does sharing have benefit  for those who share? http://www.flickr.com/photos/sunrise/35819369/
    • Aim 1:  Does sharing have benefit  for those who share? Benefit of value:  Citations. http://www.flickr.com/photos/sunrise/35819369/
    • Aim 1:  Does sharing have benefit  for those who share? dataset 85 cancer microarray trials published in 1999-2003, as identified by Ntzani and Ioannidis (2003) citations ISI Web of Science Citation index, citations from 2004-2005 data sharing locations Publisher and lab websites, microarray databases, WayBack Internet Archive, Oncomine statistics Multivariate linear regression
    • Aim 1:  Does sharing have benefit  for those who share?
    • Aim 1:  Does sharing have benefit  for those who share? Note the logarithmic scale
    • Aim 1:  Does sharing have benefit  for those who share?
    • Aim 1:  Does sharing have benefit  for those who share? Conclusion:   data sharing is associated with an increase  in citation rate
    • Next: What factors predict sharing? http://www.flickr.com/photos/ryanr/142455033/
    • Can I use the same methods of Aim 1  to choose studies and determine data  sharing status?
    • Can I use the same methods of Aim 1  to choose studies and determine data  sharing status? No, those methods don’t scale to identify or  classify enough datapoints
    • Aim 2
    • Need automated methods to: Aim 2a: Identify studies that create datasets Aim 2b: Determine which of these have in fact been shared
    • Aim 2a: Identify studies that create  gene expression microarray data http://www.flickr.com/photos/lofaesofa/248546821/
    • Aim 2a: Identify studies that create  gene expression microarray data Easy, via MeSH indexing terms? gene expression profiling and/or microarray analysis Unfortunately, these have neither high  recall nor precision.
    • Aim 2a: Identify studies that create  gene expression microarray data Look for wetlab methods in full text: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1522022&tool=pmcentrez http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1590031&tool=pmcentrez http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1482311&tool=pmcentrez#id331936 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2082469&tool=pmcentrez http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=126870&tool=pmcentrez#id442745
    • Query environment: Full-text portals query 85% of articles available through U of Pittsburgh library digital subscriptions.
    • Development set? Open access articles.
    • Features? Unigrams and bigrams from full text Training classifications? Automatic filter for whether publication had an associated dataset deposited in a database Feature selection and combination:
    • Derived query: ("gene expression" AND microarray AND cell AND rna) AND (rneasy OR trizol OR "real-time pcr") NOT (“tissue microarray*” OR “cpg island*”)
    • Evaluation: Ochsner et al. Nature Methods (2008) vol. 5 (12) pp. 991 • 400 studies across 20 journals Precision: 90% (86% to 93%) Recall: 56% (52% to 61%)
    • Aim 2a: Identify studies that create  gene expression microarray data Conclusion:   We derived a query with high precision and  adequate recall to identify studies that  created microarray data
    • Aim 2b
    • Aim 2b: Identify studies that share  their expression microarray data http://www.flickr.com/photos/dcassaa/422261773/
    • Aim 2b: Identify studies that share  their expression microarray data
    • Aim 2b: Identify studies that share  their expression microarray data
    • Aim 2b: Identify studies that share  their expression microarray data Querying GEO and ArrayExpress for  PubMed IDs identified 77% of datasets  that were publicly available somewhere on  the internet.
    • Aim 2b: Identify studies that share  their expression microarray data
    • Aim 2b: Identify studies that share  their expression microarray data
    • Aim 2b: Identify studies that share  their expression microarray data Conclusion:   we have a method to find most gene  expression microarray datasets shared on  the internet, without much bias.
    • Aim 3
    • Aim 3 – How often is data shared?  What predicts sharing?  How can we model sharing behavior? Aim 2a  +  Aim 2b  +  lots of stats http://www.flickr.com/photos/cogdog/123072/
    • Funder Journal Investigator Institution Study Is research data shared after publication?
    • Funder Journal Investigator Institution Study funded by impact years since sector humans? NIH? factor first paper size mice? size of strength of # pubs grant policy impact plants? # citations rank sharing open cancer? plan req’d? access? previously country shared? clinical funded by number of trial? non-NIH? microarray previously reused? number of studies authors published gender year
    • journal rank
    • journal data sharing policy “An inherent principle of publication is that others should be able to replicate and build upon the authors' published claims. Therefore, a condition of publication in a Nature journal is that authors are required to make materials, data and associated protocols available in a publicly accessible database …” http://www.nature.com/authors/editorial_policies/availability.html http://www.nature.com/nature/journal/v453/n7197/index.html
    • institution rank Yu et al. BMC medical informatics and decision making (2007) vol. 7 pp. 17
    • study type
    • author “experience” Author publication history: Author name Author-ity web service Torvik & Smalheiser. (2009). Author Name disambiguation: Disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3):11. Citation counts:
    • author gender
    • funding level PubMed grant lists + NIH grant details
    • funder mandates Requires a data sharing plan for studies funded after October 2003 that receive more than $500 000 in direct funding per year
    • funder mandates Proxy for NIH data sharing policy applicability: If in any year since 2004, • funded by an NIH grant number with a “1” or “2” type code • received more than $750 000 in total funding from the grant
    • and so on... 124 variables
    • stats Univariate proportions Factor analysis Logistic regression Second-order factor analysis More logistic regression
    • http://www.flickr.com/photos/blatzandchocolate/4281306244/
    • results 11,603 datapoints we found shared datasets for 25%
    • Proportion of articles with shared datasets, by year 0.35 Proportion of articles with datasets found in GEO or ArrayExpress 0.30 0.25 0.20 0.15 Across time 0.10 0.05 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Year article published
    • univariate analysis
    • Proportion of datasets shared 0.0 0.2 0.4 0.6 0.8 1.0 Physiol Genomics PLoS Genet Genome Biol Microbiology PLoS One BMC Genomics Plant Cell Genome Res Eukaryot Cell Appl Environ Microbiol BMC Med Genomics Hum Mol Genet Proc Natl Acad Sci U S A Infect Immun Am J Respir Cell Mol Biol Dev Biol J Bacteriol Mol Endocrinol BMC Cancer Plant Physiol Biol Reprod Blood J Immunol FASEB J Toxicol Sci J Exp Bot Nucleic Acids Res Diabetes Mol Cell Biol Mol Cancer Ther BMC Bioinformatics Stem Cells FEBS Lett J Neurosci Am J Pathol J Biol Chem J Virol OTHER Cancer Res J Clin Endocrinol Metab Plant Mol Biol Clin Cancer Res Genomics Journals Invest Ophthalmol Vis Sci Mol Hum Reprod Carcinogenesis Gene Endocrinology Oncogene Cancer Lett Biochem Biophys Res Commun
    • Proportion of datasets shared 0.0 0.2 0.4 0.6 0.8 1.0 Stanford University University of Pennsylvania University of Illinois University of California, Los Angeles University of Wisconsin, Madison University of Washington University of California, Davis The University of British Columbia University of California, San Francisco University of Florida University of California, San Diego University of Minnesota, Twin Cities Baylor College of Medicine OTHER Max Planck Gesellschaft Harvard University Duke University Medical Center Yale University Johns Hopkins University University of Pittsburgh Washington University in Saint Louis University of Toronto University of California, Berkeley University of Michigan, Ann Arbor Michigan State University Institutions National Cancer Institute Tokyo Daigaku
    • Proportion of datasets shared 0.0 0.2 0.4 0.6 0.8 1.0 1 101 201 301 401 501 601 701 801 901 1001 1101 1201 1301 rank 1401 1501 1601 1701 1801 1901 Institution
    • multivariate analysis
    • factor analysis
    • logistic regression
    • Multivariate nonlinear regressions with interactions Odds Ratio 0.25 0.50 1.00 2.00 4.00 8.00 Has journal policy Multivariate nonlinear regressions with interactions Count of R01 & other NIH grants Odds Ratio 0.95 0.25 0.50 1.00 2.00 4.00 8.00 Authors prev GEOAE sharing & OA & microarray creation Has journal policy NO K funding other P funding Count of R01 & or NIH grants 0.95 Authors prev GEOAE sharing & OA & microarray creation NO K Journalfunding funding or P impact Institution high citations & collaboration Journal policy consequences & Journal impact long halflife Journal policy consequences & long halflife Institution high citations NOTcollaboration & animals or mice Instititution is government & NOT higher ed NOT animals or mice Last author num prev pubs & first year pub Large NIH grant Instititution is government & NOT higher ed Humans & cancer NO geo reuse + YES high institution output Last author num prev pubs & first year pub First author num prev pubs & first year pub Large NIH grant Humans & cancer NO geo reuse + YES high institution output First author num prev pubs & first year pub
    • Multivariate nonlinear regressions with interactions Odds Ratio 0.25 0.50 1.00 2.00 4.00 8.00 Has journal policy Multivariate nonlinear regressions with interactions Count of R01 & other NIH grants Odds Ratio 0.95 0.25 0.50 1.00 2.00 4.00 8.00 Authors prev GEOAE sharing & OA & microarray creation Has journal policy NO K funding other P funding Count of R01 & or NIH grants 0.95 Authors prev GEOAE sharing & OA & microarray creation NO K Journalfunding funding or P impact Institution high citations & collaboration Journal policy consequences & Journal impact long halflife Journal policy consequences & long halflife Institution high citations NOTcollaboration & animals or mice Instititution is government & NOT higher ed NOT animals or mice Last author num prev pubs & first year pub Large NIH grant Instititution is government & NOT higher ed Humans & cancer NO geo reuse + YES high institution output Last author num prev pubs & first year pub First author num prev pubs & first year pub Large NIH grant Humans & cancer NO geo reuse + YES high institution output First author num prev pubs & first year pub
    • second-order factor analysis
    • Instititution is government & NOT higher ed NOT institution NCI or intramural NO K funding or P funding Journal policy consequences & long halflife Authors prev GEOAE sharing & OA & microarray creation Institution high citations & collaboration NOT animals or mice First author num prev pubs & first year pub Humans & cancer Count of R01 & other NIH grants Large NIH grant Has journal policy NO geo reuse + YES high institution output Last author num prev pubs & first year pub Journal impact Instititution is government & NOT higher ed NOT institution NCI or intramural NO K funding or P funding prev GEOAE sharing & OA & microarray creation NOT animals or mice First author num prev pubs & first year pub Humans & cancer Count of R01 & other NIH grants Large NIH grant Last author num prev pubs & first year pub Journal impact Institution high citations & collaboration Has journal policy NO geo reuse + YES high institution output Journal policy consequences & long halflife
    • logistic regression using second-order factors
    • Multivariate nonlinear regression with interactions Odds Ratio 0.25 0.50 1.00 2.00 4.00 OA journal & previous GEO-AE sharing Amount of NIH funding 0.95 Journal impact factor and policy Higher Ed in USA Cancer & humans
    • Multivariate nonlinear regression with interactions Odds Ratio 0.25 0.50 1.00 2.00 4.00 OA journal & previous GEO-AE sharing Amount of NIH funding 0.95 Journal impact factor and policy Higher Ed in USA Cancer & humans
    • size of effect: split at the medians of the factors
    • Overall: 25%
    • Open access/ previous sharing: 31% Less OA/prev sharing: 19% Overall: 25%
    • Open access/ previous sharing: 31% Less OA/prev sharing: 19% Not Overall: cancer/human: cancer/human: 25% 18% 32%
    • Open access/ 24% 37% previous sharing: 31% Less 13% 25% OA/prev sharing: 19% Not Overall: cancer/human: cancer/human: 25% 18% 32%
    • Conclusions: • data sharing rates are increasing, but overall levels are low Preliminary evidence: • levels are particularly low in cancer • levels are highest for those who are publishing OA, have shared before
    • • data and filters were imperfect • many assumptions • didn’t capture all types of sharing • don’t know how generalizable across datatypes • should be considered hypothesis-generating http://www.flickr.com/photos/vlastula/300102949/
    • Goal of this dissertation: Collect useful evidence on patterns of data sharing behaviour through methods that can be applied broadly, repeatably, and cost-effectively.
    • contribution • Aim 1 publication cited 45 times in Google Scholar, including by several editorials and books • Aim 2 methods reused in a neuroethics study at UBC • Aim 3 revealed evidence suggesting areas with high and low data sharing adoption for future study • data collection was mostly automated using mostly free, and open resources • dataset, collection code, analysis scripts to be made openly available upon publication of thesis
    • what’s next? http://www.flickr.com/photos/skrb/2427171774/
    • More data analysis Including: • Citation analysis of the 11,603 articles • Analysis with a focus on policy variables • Causality through structural equation modeling doi/10.1371/journal.pone.0008469.g002
    • Begin to investigate reuse http://www.flickr.com/photos/boitabulle/3668162701/
    • who reuses data? why? when? who doesn’t? which datasets are most likely to be  reused? how many datasets could be reused but  aren’t? why aren’t they? what can we do about it? what should we do about it?
    • Post‐doc of my dreams Postdoctoral Research Associate in the Sharing, Preservation, and Stewardship of Scientific Data Potential areas of focus include: • overcoming social and technological barriers to data deposition among scientists • the roles and interactions of individual scientists, journals/publishers, institutions, and the variety of disciplinary repositories • ... http://www.flickr.com/photos/gatewaystreets/3838452287/
    • Enable new science and knowledge creation through universal access to data about life on earth and the environment that sustains it. Dryad is a repository of data underlying scientific publications, with an initial focus on evolution, ecology, and related fields. The National Evolutionary Synthesis Center, NSF-funded: • Duke University, • UNC at Chapel Hill • North Carolina State University
    • Data sharing  is hard. I share my code and data at http://www.researchremix.org It is hard. Some is better than none. Be the change you want to see. http://www.flickr.com/photos/myklroventine/892446624/
    • Thanks to the Dept of Biomedical Informatics at the U of Pittsburgh, the NLM for funding through training grant 5 T15 LM007059, those who openly publish their data, source code, papers, photos, Dr. Wendy Chapman for her support and feedback, My family.
    • http://www.flickr.com/photos/jep42/3017149415/in/set-72157608797298056/