Visualization Tools for the
Refinery Platform
Nils Gehlenborg, PhD
HARVARD MEDICAL SCHOOL・CENTER FOR BIOMEDICAL INFORMATICS
SUPPORTING REPRODUCIBLE RESEARCH
WITH PROVENANCE VISUALIZATION
REPRODUCIBLE RESEARCH
REPRODUCIBLE RESEARCH
Health & Science
The new scientific revolution: Reproducibility at last
ByBy Joel AchenbachJoel Achenbach January 27January 27
Diederik Stapel, a professor of social psychology in the Netherlands, had been a rock-star scientist — regularlyDiederik Stapel, a professor of social psychology in the Netherlands, had been a rock-star scientist — regularly
appearing on television and publishing in top journals. Among his striking discoveries was that people exposed toappearing on television and publishing in top journals. Among his striking discoveries was that people exposed to
litter and abandoned objects are more likely to be bigoted.litter and abandoned objects are more likely to be bigoted.
And yet there was often something odd about Stapel’s research. When students asked to see the data behind hisAnd yet there was often something odd about Stapel’s research. When students asked to see the data behind his
work, he couldn’t produce it readily. And colleagues would sometimes look at his data and think: It’s beautiful.work, he couldn’t produce it readily. And colleagues would sometimes look at his data and think: It’s beautiful.
Too beautiful. Most scientists have messy data, contradictory data, incomplete data, ambiguous data. This dataToo beautiful. Most scientists have messy data, contradictory data, incomplete data, ambiguous data. This data
waswas too good to be truetoo good to be true..
In late 2011, Stapel admitted that he’d been fabricating data for many years.In late 2011, Stapel admitted that he’d been fabricating data for many years.
The Stapel case was an outlier, an extreme example of scientific fraud. But this and several other high-profileThe Stapel case was an outlier, an extreme example of scientific fraud. But this and several other high-profile
cases of misconduct resonated in the scientific community because of a much broader, more pernicious problem:cases of misconduct resonated in the scientific community because of a much broader, more pernicious problem:
Too often, experimental results can’t be reproduced.Too often, experimental results can’t be reproduced.
That doesn’t mean the results are fraudulent or even wrong. But in science, a result is supposed to be verifiable byThat doesn’t mean the results are fraudulent or even wrong. But in science, a result is supposed to be verifiable by
a subsequent experiment. An irreproducible result is inherently squishy.a subsequent experiment. An irreproducible result is inherently squishy.
And so there’s a movement afoot, and building momentum rapidly. Roughly four centuries after the invention ofAnd so there’s a movement afoot, and building momentum rapidly. Roughly four centuries after the invention of
the scientific method, the leaders of the scientific community are recalibrating their requirements, pushing forthe scientific method, the leaders of the scientific community are recalibrating their requirements, pushing for
the sharing of data and greater experimental transparency.the sharing of data and greater experimental transparency.
Top-tier journals, such as Science and Nature, haveTop-tier journals, such as Science and Nature, have announced new guidelinesannounced new guidelines for the research they publish.for the research they publish.
“We need to go back to basics,” said Ritu Dhand, the editorial director of the Nature group of journals. “We need“We need to go back to basics,” said Ritu Dhand, the editorial director of the Nature group of journals. “We need
to train our students over what is okay and what is not okay, and not assume that they know.”to train our students over what is okay and what is not okay, and not assume that they know.”
The pharmaceutical companies are part of this movement. Big Pharma has massive amounts of money at stakeThe pharmaceutical companies are part of this movement. Big Pharma has massive amounts of money at stake
and wants to see more rigorous pre-clinical results from outside laboratories. The academic laboratories act asand wants to see more rigorous pre-clinical results from outside laboratories. The academic laboratories act as
OBITUARY Wylie Vale
and an elusive stress
hormone p.542
HISTORYOFSCIENCE Descartes’
lost letter tracked using
Google p.540
EARTHSYSTEMS Past climates
give valuable clues to future
warming p.537
AVIANINFLUENZA Shift expertise
to track mutations where
they emerge p.534
Raise standards for
preclinical cancer research
C. Glenn Begley and Lee M. Ellis propose how methods, publications and
incentives must change if patients are to benefit.
E
fforts over the past decade to
characterize the genetic alterations
in human cancers have led to a better
understanding of molecular drivers of this
complex set of diseases. Although we in the
cancer field hoped that this would lead to
more effective drugs, historically, our ability
to translate cancer research to clinical suc-
cess has been remarkably low1
. Sadly, clinical
trials in oncology have the highest failure
rate compared with other therapeutic areas.
Given the high unmet need in oncology, it
is understandable that barriers to clinical
development may be lower than for other
disease areas, and a larger number of drugs
with suboptimal preclinical validation will
enter oncology trials. However, this low suc-
cess rate is not sustainable or acceptable, and
investigators must reassess their approach to
translating discovery research into greater
clinical success and impact.
Many factors are responsible for the high
failure rate, notwithstanding the inher-
ently difficult nature of this disease. Cer-
tainly, the limitations of preclinical tools
such as inadequate cancer-cell-line and
mouse models2
make it difficult for even
Many landmark findings in preclinical oncology research are not reproducible, in part because of inadequate cell lines and animal models.
S.GSCHMEISSNER/SPL
2 9 M A R C H 2 0 1 2 | V O L 4 8 3 | N A T U R E | 5 3 1
COMMENT
© 2012 Macmillan Publishers Limited. All rights reserved
LINK TO ORIGINAL ARTICLE
A recent report by Arrowsmith noted that the
success rates for new development projects in
Phase II trials have fallen from 28% to 18% in
recent years, with insufficient efficacy being
the most frequent reason for failure (Phase II
failures: 2008–2010. Nature Rev. Drug Discov.
10, 328–329 (2011))1
. This indicates the limi-
tations of the predictivity of disease models
and also that the validity of the targets being
investigated is frequently questionable, which
is a crucial issue to address if success rates in
clinical trials are to be improved.
Candidate drug targets in industry are
derived from various sources, including in-
house target identification campaigns, in-
licensing and public sourcing, in particular
based on reports published in the literature and
presented at conferences. During the transfer
of projects from an academic to a company
setting, the focus changes from ‘interesting’
to ‘feasible/marketable’, and the financial costs
of pursuing a full-blown drug discovery and
development programme for a particular tar-
get could ultimately be hundreds of millions of
Euros. Even in the earlier stages, investments
in activities such as high-throughput screen-
ing programmes are substantial, and thus the
validity of published data on potential targets
is crucial for companies when deciding to start
novel projects.
To mitigate some of the risks of such invest-
ments ultimately being wasted, most phar-
maceutical companies run in-house target
validation programmes. However, validation
projects that were started in our company
based on exciting published data have often
resulted in disillusionment when key data
could not be reproduced. Talking to scien-
tists, both in academia and in industry, there
seems to be a general impression that many
results that are published are hard to repro-
duce. However, there is an imbalance between
this apparently widespread impression and its
public recognition (for example, see REFS 2,3),
and the surprisingly few scientific publica-
tions dealing with this topic. Indeed, to our
knowledge, so far there has been no published
in-depth, systematic analysis that compares
reproduced results with published results for
wet-labexperimentsrelatedtotargetidentifica-
tion and validation.
Early research in the pharmaceutical indus-
try, with a dedicated budget and scientists who
mainly work on target validation to increase
the confidence in a project, provides a unique
opportunity to generate a broad data set on the
reproducibility of published data. To substanti-
ate our incidental observations that published
reports are frequently not reproducible with
quantitative data, we performed an analysis
of our early (target identification and valida-
tion)in-houseprojectsinourstrategicresearch
fields of oncology, women’s health and cardio-
vascular diseases that were performed over the
past 4 years (FIG. 1a). We distributed a ques-
tionnaire to all involved scientists from target
discovery, and queried names, main relevant
published data (including citations), in-house
dataobtainedandtheirrelationshiptothepub-
lished data, the impact of the results obtained
for the outcome of the projects, and the models
Believe it or not: how much can we
rely on published data on potential
drug targets?
Florian Prinz, Thomas Schlange and Khusru Asadullah
Figure 1 | Analysis of the reproducibility of published data in 67 in-
house projects. a | This figure illustrates the distribution of projects within
theoncology,women’shealthandcardiovascularindicationsthatwereana-
lysed in this study. b | Several approaches were used to reproduce the pub-
lished data. Models were either exactly copied, adapted to internal needs
(forexample,usingothercelllinesthanthosepublished,otherassaysandso
on) or the published data was transferred to models for another indication.
‘Notapplicable’referstoprojectsinwhichgeneralhypothesescouldnotbe
verified.c | Relationshipofpublisheddatatoin-housedata.Theproportion
of each of the following outcomes is shown: data were completely in line
withpublisheddata;themainsetwasreproducible;someresults(including
themostrelevanthypothesis)werereproducible;orthedatashowedincon-
sistenciesthatledtoprojecttermination. ‘Notapplicable’referstoprojects
that were almost exclusively based on in-house data, such as gene expres-
sionanalysis.Thenumberofprojectsandthepercentageofprojectswithin
thisstudy(a– c)areindicated.d|Acomparisonofmodelusageintherepro-
ducible and irreproducible projects is shown. The respective numbers of
projectsandthepercentagesofthegroupsareindicated.
CORRESPONDENCE
NATURE REVIEWS | DRUG DISCOVERY www.nature.com/reviews/drugdisc
© 2011 Macmillan Publishers Limited. All rights reserved
1. Statistical issues
2. No access to data
3. No access to software
4. Insufficient description of experimental protocols
5. Insufficient description of data analysis process

…
CHALLENGES FOR REPRODUCIBILITY
N Gehlenborg et al. , manuscript in preparation
REPRODUCIBLE AND INTEGRATIVE ANALYSIS
Refinery Platform
|
DATA REPOSITORY
VISUALIZATION TOOLS
ANALYSIS PIPELINES
N Gehlenborg et al. , manuscript in preparation
DATA REPOSITORY
N Gehlenborg et al. , manuscript in preparation
DATA REPOSITORY Meta Data
N Gehlenborg et al. , manuscript in preparation
DATA REPOSITORY Meta Data
TREATMENT
CELL LINE
TIME POINT
…
DATA REPOSITORY
N Gehlenborg et al. , manuscript in preparation
Raw

Data
Derived

Data
Derived

Data
Meta Data
DATA REPOSITORY
N Gehlenborg et al. , manuscript in preparation
AssaySampleSource Raw

Data
Derived

Data
Derived

Data
Meta Data
DATA REPOSITORY
N Gehlenborg et al. , manuscript in preparation
AssaySampleSource Raw

Data
Derived

Data
Derived

Data
Provenance
Meta Data
PROTOCOLS
DATA REPOSITORY
N Gehlenborg et al. , manuscript in preparation
AssaySampleSource Raw

Data
Derived

Data
Derived

Data
Provenance
Meta Data
PROTOCOLS
ALGORITHMS
DATA REPOSITORY
N Gehlenborg et al. , manuscript in preparation
AssaySampleSource Raw

Data
Derived

Data
Derived

Data
Experiment Graph
Provenance
Meta Data
N Gehlenborg et al. , manuscript in preparation
REPRODUCIBLE AND INTEGRATIVE ANALYSIS
Refinery Platform
|
DATA REPOSITORY
VISUALIZATION TOOLS
ANALYSIS PIPELINES
ANALYSIS PIPELINES
N Gehlenborg et al. , manuscript in preparation
|
DATA REPOSITORY
VISUALIZATION TOOLS
ANALYSIS PIPELINES
ANALYSIS PIPELINES
N Gehlenborg et al. , manuscript in preparation
|
DATA REPOSITORY
VISUALIZATION TOOLS
ANALYSIS PIPELINES
GALAXY
ANALYSIS PIPELINES
N Gehlenborg et al. , manuscript in preparation
|
DATA REPOSITORY
VISUALIZATION TOOLS
ANALYSIS PIPELINES
GALAXY
REST
API
ANALYSIS PIPELINES
N Gehlenborg et al. , manuscript in preparation
|
DATA REPOSITORY
VISUALIZATION TOOLS
ANALYSIS PIPELINES
GALAXY
REST
API
ANALYSIS PIPELINES
N Gehlenborg et al. , manuscript in preparation
|
DATA REPOSITORY
VISUALIZATION TOOLS
ANALYSIS PIPELINES
GALAXY
Tools
REST
API
ANALYSIS PIPELINES
N Gehlenborg et al. , manuscript in preparation
|
DATA REPOSITORY
VISUALIZATION TOOLS
ANALYSIS PIPELINES
GALAXY Toolshed
Tools
REST
API
ANALYSIS PIPELINES
N Gehlenborg et al. , manuscript in preparation
|
DATA REPOSITORY
VISUALIZATION TOOLS
ANALYSIS PIPELINES
GALAXY Toolshed
Workflow Editor
Tools
REST
API
ANALYSIS PIPELINES
N Gehlenborg et al. , manuscript in preparation
|
DATA REPOSITORY
VISUALIZATION TOOLS
ANALYSIS PIPELINES
GALAXY Toolshed
Workflow Editor
Tools
REST
API
Workflow Inputs
Workflow Outputs
N Gehlenborg et al. , manuscript in preparation
AssaySampleSource Raw

Data
Derived

Data
Experiment Graph
ANALYSIS PIPELINES
Derived

Data
N Gehlenborg et al. , manuscript in preparation
AssaySampleSource Raw

Data
Derived

Data
Experiment Graph
ANALYSIS PIPELINES
Derived

Data
N Gehlenborg et al. , manuscript in preparation
AssaySampleSource Raw

Data
Derived

Data
Experiment Graph
ANALYSIS PIPELINES
Derived

Data
N Gehlenborg et al. , manuscript in preparation
AssaySampleSource Raw

Data
Derived

Data
Experiment Graph
ANALYSIS PIPELINES
Derived

Data
Derived

Data
N Gehlenborg et al. , manuscript in preparation
AssaySampleSource Raw

Data
Derived

Data
Experiment Graph
ANALYSIS PIPELINES
Derived

Data
Derived

Data
N Gehlenborg et al. , manuscript in preparation
AssaySampleSource Raw

Data
Derived

Data
Experiment Graph
ANALYSIS PIPELINES
Derived

Data
WORKFLOW &
PARAMETERS
Derived

Data
N Gehlenborg et al. , manuscript in preparation
AssaySampleSource Raw

Data
Derived

Data
Experiment Graph
ANALYSIS PIPELINES
Derived

Data
Derived

Data
N Gehlenborg et al. , manuscript in preparation
AssaySampleSource Raw

Data
Derived

Data
Experiment Graph
ANALYSIS PIPELINES
Derived

Data
Derived

Data
N Gehlenborg et al. , manuscript in preparation
ANALYSIS PIPELINES
N Gehlenborg et al. , manuscript in preparation
REPRODUCIBLE AND INTEGRATIVE ANALYSIS
Refinery Platform
|
DATA REPOSITORY
VISUALIZATION TOOLS
ANALYSIS PIPELINESISA-TAB ISA-TAB
| VISUALIZATION TOOLS
N Gehlenborg et al. , manuscript in preparation
| VISUALIZATION TOOLS
N Gehlenborg et al. , manuscript in preparation
N Gehlenborg et al. , manuscript in preparation
REPRODUCIBLE AND INTEGRATIVE ANALYSIS
N Gehlenborg et al. , manuscript in preparation
REPRODUCIBLE AND INTEGRATIVE ANALYSIS
USE CASES
1. Collaboration between computational and experimental labs
2. Repository for large-scale, data-generating projects
3. Integration with existing repositories
AssaySampleSource Raw

Data
Derived

Data
Derived

Data
VISUALIZATION OF PROVENANCE INFORMATION
AssaySampleSource Raw

Data
Derived

Data
Derived

Data
VISUALIZATION OF PROVENANCE INFORMATION
AssaySampleSource Raw

Data
Derived

Data
Derived

Data
VISUALIZATION OF PROVENANCE INFORMATION
1. How do we represent provenance information?
2. How do we make provenance information actionable?
AssaySampleSource Raw

Data
Derived

Data
Derived

Data
VISUALIZATION OF PROVENANCE INFORMATION
Stefan Luger, BSc
JOHANNES KEPLER UNIVERSITY LINZ
Marc Streit, PhD
JOHANNES KEPLER UNIVERSITY LINZ
AssaySampleSource Raw

Data
Derived

Data
Derived

Data
VISUALIZATION OF PROVENANCE INFORMATION
VISUALIZATION OF PROVENANCE INFORMATION
breadth
depth
BASIC APPROACH
FILTERING:
BASED ON META DATA
DEALING WITH BREADTH:
LAYERING
CONTROLLING LEVEL OF DETAIL:
DEGREE OF INTEREST (DOI)
no lens fish-eye lens
OUTLOOK:
MAKE PROVENANCE ACTIONABLE
N Gehlenborg et al. , manuscript in preparation
REPRODUCIBLE AND INTEGRATIVE ANALYSIS
HARVARD MEDICAL SCHOOL
JOHANNES KEPLER UNIVERSITY LINZ Stefan Luger
Samuel Gratzl
Holger Stitz
Marc Streit
HARVARD CHAN SCHOOL OF PUBLIC HEALTH
Funding
NIH/NHGRI K99 HG007583 & Harvard Stem Cell Institute
Ilya Sytchev
Shannan Ho Sui
Winston Hide
Acknowledgements
Richard Park
Psalm Haseley
Anton Xue
Peter J Park
Methods to Enhance the Reproducibility of
Precision Medicine
Pacific Symposium on Biocomputing
The Big Island of Hawaii
January 4-8, 2016
people.fas.harvard.edu/~manrai/
http://bit.ly/patient-driven http://bit.ly/psb16-reproducibility
WE ARE HIRING!
http://j.mp/refinery-developer-jr
GREAT CONFERENCES!

Visualization Tools for the Refinery Platform - Supporting reproducible research with provenance visualization

  • 1.
    Visualization Tools forthe Refinery Platform Nils Gehlenborg, PhD HARVARD MEDICAL SCHOOL・CENTER FOR BIOMEDICAL INFORMATICS SUPPORTING REPRODUCIBLE RESEARCH WITH PROVENANCE VISUALIZATION REPRODUCIBLE RESEARCH
  • 2.
  • 3.
    Health & Science Thenew scientific revolution: Reproducibility at last ByBy Joel AchenbachJoel Achenbach January 27January 27 Diederik Stapel, a professor of social psychology in the Netherlands, had been a rock-star scientist — regularlyDiederik Stapel, a professor of social psychology in the Netherlands, had been a rock-star scientist — regularly appearing on television and publishing in top journals. Among his striking discoveries was that people exposed toappearing on television and publishing in top journals. Among his striking discoveries was that people exposed to litter and abandoned objects are more likely to be bigoted.litter and abandoned objects are more likely to be bigoted. And yet there was often something odd about Stapel’s research. When students asked to see the data behind hisAnd yet there was often something odd about Stapel’s research. When students asked to see the data behind his work, he couldn’t produce it readily. And colleagues would sometimes look at his data and think: It’s beautiful.work, he couldn’t produce it readily. And colleagues would sometimes look at his data and think: It’s beautiful. Too beautiful. Most scientists have messy data, contradictory data, incomplete data, ambiguous data. This dataToo beautiful. Most scientists have messy data, contradictory data, incomplete data, ambiguous data. This data waswas too good to be truetoo good to be true.. In late 2011, Stapel admitted that he’d been fabricating data for many years.In late 2011, Stapel admitted that he’d been fabricating data for many years. The Stapel case was an outlier, an extreme example of scientific fraud. But this and several other high-profileThe Stapel case was an outlier, an extreme example of scientific fraud. But this and several other high-profile cases of misconduct resonated in the scientific community because of a much broader, more pernicious problem:cases of misconduct resonated in the scientific community because of a much broader, more pernicious problem: Too often, experimental results can’t be reproduced.Too often, experimental results can’t be reproduced. That doesn’t mean the results are fraudulent or even wrong. But in science, a result is supposed to be verifiable byThat doesn’t mean the results are fraudulent or even wrong. But in science, a result is supposed to be verifiable by a subsequent experiment. An irreproducible result is inherently squishy.a subsequent experiment. An irreproducible result is inherently squishy. And so there’s a movement afoot, and building momentum rapidly. Roughly four centuries after the invention ofAnd so there’s a movement afoot, and building momentum rapidly. Roughly four centuries after the invention of the scientific method, the leaders of the scientific community are recalibrating their requirements, pushing forthe scientific method, the leaders of the scientific community are recalibrating their requirements, pushing for the sharing of data and greater experimental transparency.the sharing of data and greater experimental transparency. Top-tier journals, such as Science and Nature, haveTop-tier journals, such as Science and Nature, have announced new guidelinesannounced new guidelines for the research they publish.for the research they publish. “We need to go back to basics,” said Ritu Dhand, the editorial director of the Nature group of journals. “We need“We need to go back to basics,” said Ritu Dhand, the editorial director of the Nature group of journals. “We need to train our students over what is okay and what is not okay, and not assume that they know.”to train our students over what is okay and what is not okay, and not assume that they know.” The pharmaceutical companies are part of this movement. Big Pharma has massive amounts of money at stakeThe pharmaceutical companies are part of this movement. Big Pharma has massive amounts of money at stake and wants to see more rigorous pre-clinical results from outside laboratories. The academic laboratories act asand wants to see more rigorous pre-clinical results from outside laboratories. The academic laboratories act as
  • 4.
    OBITUARY Wylie Vale andan elusive stress hormone p.542 HISTORYOFSCIENCE Descartes’ lost letter tracked using Google p.540 EARTHSYSTEMS Past climates give valuable clues to future warming p.537 AVIANINFLUENZA Shift expertise to track mutations where they emerge p.534 Raise standards for preclinical cancer research C. Glenn Begley and Lee M. Ellis propose how methods, publications and incentives must change if patients are to benefit. E fforts over the past decade to characterize the genetic alterations in human cancers have led to a better understanding of molecular drivers of this complex set of diseases. Although we in the cancer field hoped that this would lead to more effective drugs, historically, our ability to translate cancer research to clinical suc- cess has been remarkably low1 . Sadly, clinical trials in oncology have the highest failure rate compared with other therapeutic areas. Given the high unmet need in oncology, it is understandable that barriers to clinical development may be lower than for other disease areas, and a larger number of drugs with suboptimal preclinical validation will enter oncology trials. However, this low suc- cess rate is not sustainable or acceptable, and investigators must reassess their approach to translating discovery research into greater clinical success and impact. Many factors are responsible for the high failure rate, notwithstanding the inher- ently difficult nature of this disease. Cer- tainly, the limitations of preclinical tools such as inadequate cancer-cell-line and mouse models2 make it difficult for even Many landmark findings in preclinical oncology research are not reproducible, in part because of inadequate cell lines and animal models. S.GSCHMEISSNER/SPL 2 9 M A R C H 2 0 1 2 | V O L 4 8 3 | N A T U R E | 5 3 1 COMMENT © 2012 Macmillan Publishers Limited. All rights reserved LINK TO ORIGINAL ARTICLE A recent report by Arrowsmith noted that the success rates for new development projects in Phase II trials have fallen from 28% to 18% in recent years, with insufficient efficacy being the most frequent reason for failure (Phase II failures: 2008–2010. Nature Rev. Drug Discov. 10, 328–329 (2011))1 . This indicates the limi- tations of the predictivity of disease models and also that the validity of the targets being investigated is frequently questionable, which is a crucial issue to address if success rates in clinical trials are to be improved. Candidate drug targets in industry are derived from various sources, including in- house target identification campaigns, in- licensing and public sourcing, in particular based on reports published in the literature and presented at conferences. During the transfer of projects from an academic to a company setting, the focus changes from ‘interesting’ to ‘feasible/marketable’, and the financial costs of pursuing a full-blown drug discovery and development programme for a particular tar- get could ultimately be hundreds of millions of Euros. Even in the earlier stages, investments in activities such as high-throughput screen- ing programmes are substantial, and thus the validity of published data on potential targets is crucial for companies when deciding to start novel projects. To mitigate some of the risks of such invest- ments ultimately being wasted, most phar- maceutical companies run in-house target validation programmes. However, validation projects that were started in our company based on exciting published data have often resulted in disillusionment when key data could not be reproduced. Talking to scien- tists, both in academia and in industry, there seems to be a general impression that many results that are published are hard to repro- duce. However, there is an imbalance between this apparently widespread impression and its public recognition (for example, see REFS 2,3), and the surprisingly few scientific publica- tions dealing with this topic. Indeed, to our knowledge, so far there has been no published in-depth, systematic analysis that compares reproduced results with published results for wet-labexperimentsrelatedtotargetidentifica- tion and validation. Early research in the pharmaceutical indus- try, with a dedicated budget and scientists who mainly work on target validation to increase the confidence in a project, provides a unique opportunity to generate a broad data set on the reproducibility of published data. To substanti- ate our incidental observations that published reports are frequently not reproducible with quantitative data, we performed an analysis of our early (target identification and valida- tion)in-houseprojectsinourstrategicresearch fields of oncology, women’s health and cardio- vascular diseases that were performed over the past 4 years (FIG. 1a). We distributed a ques- tionnaire to all involved scientists from target discovery, and queried names, main relevant published data (including citations), in-house dataobtainedandtheirrelationshiptothepub- lished data, the impact of the results obtained for the outcome of the projects, and the models Believe it or not: how much can we rely on published data on potential drug targets? Florian Prinz, Thomas Schlange and Khusru Asadullah Figure 1 | Analysis of the reproducibility of published data in 67 in- house projects. a | This figure illustrates the distribution of projects within theoncology,women’shealthandcardiovascularindicationsthatwereana- lysed in this study. b | Several approaches were used to reproduce the pub- lished data. Models were either exactly copied, adapted to internal needs (forexample,usingothercelllinesthanthosepublished,otherassaysandso on) or the published data was transferred to models for another indication. ‘Notapplicable’referstoprojectsinwhichgeneralhypothesescouldnotbe verified.c | Relationshipofpublisheddatatoin-housedata.Theproportion of each of the following outcomes is shown: data were completely in line withpublisheddata;themainsetwasreproducible;someresults(including themostrelevanthypothesis)werereproducible;orthedatashowedincon- sistenciesthatledtoprojecttermination. ‘Notapplicable’referstoprojects that were almost exclusively based on in-house data, such as gene expres- sionanalysis.Thenumberofprojectsandthepercentageofprojectswithin thisstudy(a– c)areindicated.d|Acomparisonofmodelusageintherepro- ducible and irreproducible projects is shown. The respective numbers of projectsandthepercentagesofthegroupsareindicated. CORRESPONDENCE NATURE REVIEWS | DRUG DISCOVERY www.nature.com/reviews/drugdisc © 2011 Macmillan Publishers Limited. All rights reserved
  • 5.
    1. Statistical issues 2.No access to data 3. No access to software 4. Insufficient description of experimental protocols 5. Insufficient description of data analysis process
 … CHALLENGES FOR REPRODUCIBILITY
  • 6.
    N Gehlenborg etal. , manuscript in preparation REPRODUCIBLE AND INTEGRATIVE ANALYSIS Refinery Platform | DATA REPOSITORY VISUALIZATION TOOLS ANALYSIS PIPELINES
  • 7.
    N Gehlenborg etal. , manuscript in preparation DATA REPOSITORY
  • 8.
    N Gehlenborg etal. , manuscript in preparation DATA REPOSITORY Meta Data
  • 9.
    N Gehlenborg etal. , manuscript in preparation DATA REPOSITORY Meta Data TREATMENT CELL LINE TIME POINT …
  • 10.
    DATA REPOSITORY N Gehlenborget al. , manuscript in preparation Raw
 Data Derived
 Data Derived
 Data Meta Data
  • 11.
    DATA REPOSITORY N Gehlenborget al. , manuscript in preparation AssaySampleSource Raw
 Data Derived
 Data Derived
 Data Meta Data
  • 12.
    DATA REPOSITORY N Gehlenborget al. , manuscript in preparation AssaySampleSource Raw
 Data Derived
 Data Derived
 Data Provenance Meta Data PROTOCOLS
  • 13.
    DATA REPOSITORY N Gehlenborget al. , manuscript in preparation AssaySampleSource Raw
 Data Derived
 Data Derived
 Data Provenance Meta Data PROTOCOLS ALGORITHMS
  • 14.
    DATA REPOSITORY N Gehlenborget al. , manuscript in preparation AssaySampleSource Raw
 Data Derived
 Data Derived
 Data Experiment Graph Provenance Meta Data
  • 18.
    N Gehlenborg etal. , manuscript in preparation REPRODUCIBLE AND INTEGRATIVE ANALYSIS Refinery Platform | DATA REPOSITORY VISUALIZATION TOOLS ANALYSIS PIPELINES
  • 19.
    ANALYSIS PIPELINES N Gehlenborget al. , manuscript in preparation | DATA REPOSITORY VISUALIZATION TOOLS ANALYSIS PIPELINES
  • 20.
    ANALYSIS PIPELINES N Gehlenborget al. , manuscript in preparation | DATA REPOSITORY VISUALIZATION TOOLS ANALYSIS PIPELINES GALAXY
  • 21.
    ANALYSIS PIPELINES N Gehlenborget al. , manuscript in preparation | DATA REPOSITORY VISUALIZATION TOOLS ANALYSIS PIPELINES GALAXY REST API
  • 22.
    ANALYSIS PIPELINES N Gehlenborget al. , manuscript in preparation | DATA REPOSITORY VISUALIZATION TOOLS ANALYSIS PIPELINES GALAXY REST API
  • 23.
    ANALYSIS PIPELINES N Gehlenborget al. , manuscript in preparation | DATA REPOSITORY VISUALIZATION TOOLS ANALYSIS PIPELINES GALAXY Tools REST API
  • 24.
    ANALYSIS PIPELINES N Gehlenborget al. , manuscript in preparation | DATA REPOSITORY VISUALIZATION TOOLS ANALYSIS PIPELINES GALAXY Toolshed Tools REST API
  • 25.
    ANALYSIS PIPELINES N Gehlenborget al. , manuscript in preparation | DATA REPOSITORY VISUALIZATION TOOLS ANALYSIS PIPELINES GALAXY Toolshed Workflow Editor Tools REST API
  • 26.
    ANALYSIS PIPELINES N Gehlenborget al. , manuscript in preparation | DATA REPOSITORY VISUALIZATION TOOLS ANALYSIS PIPELINES GALAXY Toolshed Workflow Editor Tools REST API Workflow Inputs Workflow Outputs
  • 27.
    N Gehlenborg etal. , manuscript in preparation AssaySampleSource Raw
 Data Derived
 Data Experiment Graph ANALYSIS PIPELINES Derived
 Data
  • 28.
    N Gehlenborg etal. , manuscript in preparation AssaySampleSource Raw
 Data Derived
 Data Experiment Graph ANALYSIS PIPELINES Derived
 Data
  • 29.
    N Gehlenborg etal. , manuscript in preparation AssaySampleSource Raw
 Data Derived
 Data Experiment Graph ANALYSIS PIPELINES Derived
 Data
  • 30.
    N Gehlenborg etal. , manuscript in preparation AssaySampleSource Raw
 Data Derived
 Data Experiment Graph ANALYSIS PIPELINES Derived
 Data Derived
 Data
  • 31.
    N Gehlenborg etal. , manuscript in preparation AssaySampleSource Raw
 Data Derived
 Data Experiment Graph ANALYSIS PIPELINES Derived
 Data Derived
 Data
  • 32.
    N Gehlenborg etal. , manuscript in preparation AssaySampleSource Raw
 Data Derived
 Data Experiment Graph ANALYSIS PIPELINES Derived
 Data WORKFLOW & PARAMETERS Derived
 Data
  • 33.
    N Gehlenborg etal. , manuscript in preparation AssaySampleSource Raw
 Data Derived
 Data Experiment Graph ANALYSIS PIPELINES Derived
 Data Derived
 Data
  • 34.
    N Gehlenborg etal. , manuscript in preparation AssaySampleSource Raw
 Data Derived
 Data Experiment Graph ANALYSIS PIPELINES Derived
 Data Derived
 Data
  • 35.
    N Gehlenborg etal. , manuscript in preparation ANALYSIS PIPELINES
  • 36.
    N Gehlenborg etal. , manuscript in preparation REPRODUCIBLE AND INTEGRATIVE ANALYSIS Refinery Platform | DATA REPOSITORY VISUALIZATION TOOLS ANALYSIS PIPELINESISA-TAB ISA-TAB
  • 37.
    | VISUALIZATION TOOLS NGehlenborg et al. , manuscript in preparation
  • 38.
    | VISUALIZATION TOOLS NGehlenborg et al. , manuscript in preparation
  • 39.
    N Gehlenborg etal. , manuscript in preparation REPRODUCIBLE AND INTEGRATIVE ANALYSIS
  • 40.
    N Gehlenborg etal. , manuscript in preparation REPRODUCIBLE AND INTEGRATIVE ANALYSIS USE CASES 1. Collaboration between computational and experimental labs 2. Repository for large-scale, data-generating projects 3. Integration with existing repositories
  • 42.
  • 43.
  • 44.
    AssaySampleSource Raw
 Data Derived
 Data Derived
 Data VISUALIZATION OFPROVENANCE INFORMATION 1. How do we represent provenance information? 2. How do we make provenance information actionable?
  • 45.
    AssaySampleSource Raw
 Data Derived
 Data Derived
 Data VISUALIZATION OFPROVENANCE INFORMATION Stefan Luger, BSc JOHANNES KEPLER UNIVERSITY LINZ Marc Streit, PhD JOHANNES KEPLER UNIVERSITY LINZ
  • 46.
  • 47.
    VISUALIZATION OF PROVENANCEINFORMATION breadth depth
  • 48.
  • 54.
  • 60.
  • 64.
    CONTROLLING LEVEL OFDETAIL: DEGREE OF INTEREST (DOI)
  • 65.
  • 75.
  • 78.
    N Gehlenborg etal. , manuscript in preparation REPRODUCIBLE AND INTEGRATIVE ANALYSIS
  • 79.
    HARVARD MEDICAL SCHOOL JOHANNESKEPLER UNIVERSITY LINZ Stefan Luger Samuel Gratzl Holger Stitz Marc Streit HARVARD CHAN SCHOOL OF PUBLIC HEALTH Funding NIH/NHGRI K99 HG007583 & Harvard Stem Cell Institute Ilya Sytchev Shannan Ho Sui Winston Hide Acknowledgements Richard Park Psalm Haseley Anton Xue Peter J Park
  • 80.
    Methods to Enhancethe Reproducibility of Precision Medicine Pacific Symposium on Biocomputing The Big Island of Hawaii January 4-8, 2016 people.fas.harvard.edu/~manrai/ http://bit.ly/patient-driven http://bit.ly/psb16-reproducibility WE ARE HIRING! http://j.mp/refinery-developer-jr GREAT CONFERENCES!