• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Thesis Proposal Piwowar Presentation 20091109
 

Thesis Proposal Piwowar Presentation 20091109

on

  • 2,348 views

Presented at ASIS&T 2009 in the student awards section. The presentation contains an overview of my dissertation proposal, as 2009 winner of the Thomson Reuters Information Science Doctoral ...

Presented at ASIS&T 2009 in the student awards section. The presentation contains an overview of my dissertation proposal, as 2009 winner of the Thomson Reuters Information Science Doctoral Dissertation Proposal Scholarship, administered by the ASIS&T Information Science Education Committee

Statistics

Views

Total Views
2,348
Views on SlideShare
2,339
Embed Views
9

Actions

Likes
1
Downloads
11
Comments
0

2 Embeds 9

http://www.slideshare.net 8
http://www.lmodules.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Thesis Proposal Piwowar Presentation 20091109 Thesis Proposal Piwowar Presentation 20091109 Presentation Transcript

    • Foundational studies for measuring the  impact, prevalence, and patterns  of publicly sharing  biomedical research data Heather Piwowar Department of Biomedical Informatics University of Pittsburgh
    • Sharing research data http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
    • Sharing research data http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
    • Sharing research data PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
    • Sharing research data PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
    • Sharing research data PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
    • Sharing research data PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/ Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
    • http://www.flickr.com/photos/75166820@N00/5318468/
    • Shared data benefits science Verify Understand Extend Explore Combine Synergize Train Reduce
    • But... costly for authors Find Organize Document Deidentify Format Decide Ask Submit Answer questions Worry about mistakes being found Worry about data being misinterpreted Worry about being scooped Forgo money and IP and prestige???
    • As a result, policy makers have spent  lots of time and money .... http://www.flickr.com/photos/johnnyvulkan/381941233/ http://www.flickr.com/photos/tonivc/2283676770/
    • ... on initiatives, requests,  requirements, and tools Funder data sharing requirements Journal requirements and requests Databases Data sharing collaboration grids Standards Editorials, letters to the editor, discussion....
    • http://www.flickr.com/photos/mesh/14102209/
    • lots of data sharing! http://www.genome.jp/en/db_growth.html
    • but how much isn’t  shared? what isn’t shared? who isn’t sharing it? why not? how much does it matter? what can we do  about it?
    • you can not manage  what you do not measure http://www.flickr.com/photos/archeon/2941655917/
    • http://www.flickr.com/photos/archeon/2941655917/
    • Related research Data usually collected via surveys and/or manual audits http://www.flickr.com/photos/jima/606588905/
    • Models of data and knowledge  sharing
    • Andriessen. Conditions for the willingness to share knowledge, 2006.
    • Harder. SMG WP 6/2008 .
    • Cabrera and Cabrera. Int J of HR Mgmt. 2005.
    • Kuo. JASIST. 2008.
    • Limitations of the related research • manual audits: small sample sizes • surveys: few variables + self-reporting bias • not much focus on measuring demonstrated behavior • not much focus on rewards • not much focus on policy • not much focus on biomedical data other than DNA sequences
    • Needed: a study of data sharing behaviour and impact that includes • a measurement of demonstrated behavior • policy variables • estimate of rewards • a broad and deep selection of data creation instances
    • Aim 1: Does sharing have benefit for those who share? Aim 2: Can sharing and withholding be systematically measured? Aim 3: How often is data shared? What predicts sharing? How can we model sharing behavior?
    • Scope of proposed study studies Published studies with English full text available in a centralized portal variables for examination extracted from Medline and other sources
    • Microarray data http://en.wikipedia.org/wiki/DNA_microarray http://en.wikipedia.org/wiki/Image:Heatmap.png http://commons.wikimedia.org/wiki/ File:DNA_double_helix_vertikal.PNG
    • http://farm3.static.flickr.com/2146/2389590651_9bbcc9d07e.jpg
    • Aim 1
    • Aim 1:  Does sharing have benefit  for those who share? http://www.flickr.com/photos/sunrise/35819369/
    • Aim 1:  Does sharing have benefit  for those who share? Benefit of value: Citations.
    • Aim 1:  Does sharing have benefit  for those who share? dataset 85 cancer microarray trials published in 1999-2003, as identified by Ntzani and Ioannidis (2003) citations ISI Web of Science Citation index, citations from 2004-2005 data sharing locations Publisher and lab websites, microarray databases, WayBack Internet Archive, Oncomine statistics Multivariate linear regression
    • Aim 1:  Does sharing have benefit  for those who share?
    • Aim 1:  Does sharing have benefit  for those who share? Note the logarithmic scale
    • Aim 1:  Does sharing have benefit  for those who share? In multivariate regression, we found studies that had made their data publicly available received 69% more citations than similar studies that did not share their data (95% confidence interval: 18% to 143%) Piwowar, Day and Fridsma (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308
    • Aim 1 conclusion:  data sharing has a  benefit for sharers
    • Next:  What factors predict sharing? http://www.flickr.com/photos/ryanr/142455033/
    • Next:  What factors predict sharing? Can I use the same methods of Aim 1 to choose studies and determine data sharing status? http://www.flickr.com/photos/ryanr/142455033/
    • Next:  What factors predict sharing? Can I use the same methods of Aim 1 to choose studies and determine data sharing status? No, those methods donʼt scale to identify or classify enough datapoints. http://www.flickr.com/photos/ryanr/142455033/
    • Aim 2
    • Need automated methods to: Identify studies that generate datasets that could potentially be shared (Aim 2a) Determine which of these have in fact been shared (Aim 2b)
    • Aim 2a: Identify studies that create  gene expression microarray data http://www.flickr.com/photos/lofaesofa/248546821/
    • Aim 2a: Identify studies that create  gene expression microarray data Easy, via MeSH indexing terms? gene expression profiling and/or microarray analysis Unfortunately, has neither high recall nor precision.
    • Aim 2a: Identify studies that create  gene expression microarray data Instead, look for wetlab methods in full text: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1522022&tool=pmcentrez http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1590031&tool=pmcentrez http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1482311&tool=pmcentrez#id331936 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2082469&tool=pmcentrez http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=126870&tool=pmcentrez#id442745
    • Aim 2a: Identify studies that create  gene expression microarray data And query the full text through full-text query portals:
    • Aim 2a: Identify studies that create  gene expression microarray data query development Use supervised natural language processing techniques on a corpus of Open Access articles query evaluation 400 studies that created gene expression microarray data, as identified by Ochsner et al (2008) goal >90% precision, and sufficient recall to retrieve >1250 articles
    • Aim 2b
    • Aim 2b: Identify studies that share  their expression microarray data http://www.flickr.com/photos/dcassaa/422261773/
    • Aim 2b: Identify studies that share  their expression microarray data
    • Aim 2b: Identify studies that share  their expression microarray data
    • Aim 2b: Identify studies that share  their expression microarray data pmc_gds[filter] + text processing on ArrayExpress website Enough? Unbiased?
    • Aim 2b: Identify studies that share  their expression microarray data reference standard 200 the 400 studies that created gene expression microarray data have shared their microarray data, as identified by Ochsner et al (2008) goal Establish that filter has >70% recall with an unbiased representation of MeSH terms, dataset size, and dataset species
    • Aim 3
    • Aim 3 – How often is data shared?  What predicts sharing?  How can we model sharing behavior? http://www.flickr.com/photos/ryanr/142455033/
    • Aim 3a:  Prevalence of data sharing
    • Aim 3a:  Prevalence of data sharing PubMed  Created  Portal ID data? 234 PMC Yes 345 HighPr Yes 456 Scirus Yes 567 PMC Yes 678 PMC Yes 789 HighPr No 890 PMC No 901 ‐ ?
    • Aim 3a:  Prevalence of data sharing PubMed  Created  Portal ID data? 234 PMC Yes 345 HighPr Yes 456 Scirus Yes 567 PMC Yes 678 PMC Yes 789 HighPr No 890 PMC No 901 ‐ ?
    • Aim 3a:  Prevalence of data sharing PubMed  Created  Portal ID data? 234 PMC Yes 345 HighPr Yes 456 Scirus Yes 567 PMC Yes 678 PMC Yes
    • Aim 3a:  Prevalence of data sharing PubMed  Created  Shared  Portal ID data? data? 234 PMC Yes Yes 345 HighPr Yes Yes 456 Scirus Yes Yes 567 PMC Yes NO 678 PMC Yes NO
    • Aim 3a:  Prevalence of data sharing PubMed  Created  Shared  Portal ID data? data? 234 PMC Yes Yes 345 HighPr Yes Yes 456 Scirus Yes Yes 567 PMC Yes NO 678 PMC Yes NO Prevalence =    Number with Shared data Number with Created data
    • Aim 3b:  Correlates with data sharing
    • Aim 3b:  Correlates with data sharing Covariates PubMed  Created  Shared  Portal ID data? data? 234 PMC Yes Yes 345 HighPr Yes Yes 456 Scirus Yes Yes 567 PMC Yes NO 678 PMC Yes NO
    • Aim 3b:  Correlates with data sharing Features to include: • Does the journal have a data sharing policy? • Is the study funded by the NIH? • Is it subject tot the NIH data sharing plan requirement? • Number of authors • Journal impact factor • Are the experimental samples from humans? • Disease of study • Year of publication • …
    • Aim 3b:  Correlates with data sharing Covariates PubMed  Created  Shared  Journal  NIH  #  Portal ... ID data? data? policy funds? authors 234 PMC Yes Yes strong yes 2 345 HighPr Yes Yes weak yes 5 456 Scirus Yes Yes weak no 6 567 PMC Yes NO strong yes 5 678 PMC Yes NO strong no 2
    • Aim 3b:  Correlates with data sharing Univariate odds ratios Multivariate logistic regression
    • Aim 3b:  Correlates with data sharing Covariates PubMed  Created  Shared  Journal  NIH  #  Portal ... ID data? data? policy funds? authors 234 PMC Yes Yes strong yes 2 345 HighPr Yes Yes weak yes 5 456 Scirus Yes Yes weak no 6 567 PMC Yes NO strong yes 5 678 PMC Yes NO strong no 2 Journal policy? NIH funded? # authors ... Shared data?
    • Aim 3c: Model of data sharing
    • Aim 3c: Model of data sharing Covariates PubMed  Created  Shared  Journal  NIH  #  Portal ... ID data? data? policy funds? authors 234 PMC Yes Yes strong yes 2 345 HighPr Yes Yes weak yes 5 456 Scirus Yes Yes weak no 6 567 PMC Yes NO strong yes 5 678 PMC Yes NO strong no 2
    • Aim 3c: Model of data sharing Exploratory factor analysis
    • Aim 3c: Model of data sharing Covariates PubMed  Created  Shared  Journal  NIH  #  Portal ... ID data? data? policy funds? authors 234 PMC Yes Yes strong yes 2 345 HighPr Yes Yes weak yes 5 456 Scirus Yes Yes weak no 6 567 PMC Yes NO strong yes 5 678 PMC Yes NO strong no 2 Mandates Amount of  Collaboration Shared data? ...
    • Aim 3c: Model of data sharing Covariates PubMed  Created  Shared  Journal  NIH  #  Portal ... ID data? data? policy funds? authors 234 PMC Yes Yes strong yes 2 345 HighPr Yes Yes weak yes 5 456 Scirus Yes Yes weak no 6 567 PMC Yes NO strong yes 5 678 PMC Yes NO strong no 2 Mandates Amount of  Weak Collaboration Strong Shared data? ...
    • http://www.flickr.com/photos/donjuanna/322798429/
    • Limitations • Association does not imply causation • Important influences will be missed due to focus on measurable variables • Some derived variables involve many estimates and assumptions • Only considering public sharing in primary centralized databases • Only one datatype • Only research studies made available in full-text portals
    • Risks and contingency plans NLP performance may be inadequate supplement with manual annotating via Mechanical Turk Author ambiguity may introduce extreme outliers use Author-ity (Smalheiser and Torvik, 2005) for name disambiguation Unable to derive a robust exploratory factor model try other clustering techniques Several variables may be unexpectedly difficult to extract and cross-references if not essential, defer analysis of that variable
    • Current status Aim 1: Does sharing have benefit for those who share? Aim 2: Can sharing and withholding be systematically measured? ete d. Aim 3: How often is data shared? ction om pl lot c What predicts sharing? set c olle pi ata behavior? ll d How can we model sharing fu No w:
    • Anticipated contributions • Published assessment of the observed and measured rewards, prevalence, and patterns of gene expression microarray dataset sharing • Publicly available dataset associating microarray study publications with data sharing status • Generalizable approach for developing practical, real-world information retrieval using centralized full-text query portals • Preliminary model of data sharing behaviour based on this large dataset
    • Future work • Identify and model data reuse • Citation analysis of the large cohort • Supplement with survey responses http://www.flickr.com/photos/cogdog/123072/
    • Data sharing plan I post my data, code, and statistical scripts at http://www.dbmi.pitt.edu/piwowar Share yours too! http://www.flickr.com/photos/myklroventine/892446624/
    • Thanks to: ➡ the NLM for funding training grant 5 T15 LM007059-22 ➡ the Dept of Biomedical Informatics at the U of Pittsburgh ➡ my committee Dr Wendy Chapman Biomed Informatics Dr Ellen Detlefsen iSchool Dr Madhavi Ganapathiraju Bioinformatics Dr Brian Butler Katz School of Business Dr Gunther Eysenbach U of Toronto, Health Policy Mgmt and Evaluation
    • Funder Journal Investigator Institution Study Is research data shared after publication? aim
    • Prevalence of data withholding  via surveys self-reported denying a request in last 3 years trainees self-reported denying a request been denied access to data, materials, code authors “not able to retrieve raw data” not willing to release data 0% 10% 20% 30% 40% Campbell et al. JAMA. 2002. Kyzas et al. J Natl Cancer Inst. 2005. Vogeli et al. Acad Med. 2006. Reidpath et al. Bioethics 2001.
    • Self‐reported reasons for data  withholding sharing is too much effort want student or jr faculty to publish more they themselves want to publish more cost industrial sponsor confidentiality commercial value of results 0% 20% 40% 60% 80% Campbell et al. JAMA 2002.
    • Correlates with self‐reported data  withholding industry involvement perceived competitiveness of field male sharing discouraged in training human participants academic productivity 0 1 2 3 Blumenthal et al. Acad Med. 2006