Typically in predictive data analysis challenges, participants are provided a dataset and asked to make predictions. Participants include with their prediction the scripts/code used to produce it. Challenge administrators validate the winning model by reconstructing and running the source code.
Often data cannot be provided to participants directly, e.g. due to data sensitivity (data may be from living human subjects) or data size (tens of terabytes). Further, predictions must be reproducible from the code provided by particpants. Containerization is an excellent solution to these problems: Rather than providing the data to the participants, we ask the participants to provided a Dockerized "trainable" model. We run the both the training and validation phases of machine learning and guarantee reproducibility 'for free'.
We use the Docker tool suite to spin up and run servers in the cloud to process the queue of submitted containers, each essentially a batch job. This fleet can be scaled to match the level of activity in the challenge. We have used Docker successfully in our 2015 ALS Stratification Challenge and our 2015 Somatic Mutation Calling Tumour Heterogeneity (SMC-HET) Challenge, and are starting an implementation for our 2016 Digitial Mammography Challenge.
How to Remove Document Management Hurdles with X-Docs?
Docker in Open Science Data Analysis Challenges by Bruce Hoff
1. Docker in Open Science
Data Analysis Challenges
Bruce Hoff
Principal Software Engineer,
Sage Bionetworks
2. Open Science in Disease Research
Containerization as a tool for scientific reproducibility
Case Study: Docker in the 2015 ALS Stratification Challenge
Case Study: Docker in the 2016 Digital Mammography Challenge
Open Issues and Lessons Learned
Agenda
3. This talk is about saving lives.
Disease research is data intensive…
… but published analyses often aren’t
reproducible.
… and valuable data sets aren’t shared freely.
… which reduces the rate of progress.
4. Difficulties in science validation
Amgen scientists tried to confirm 53 landmark papers in pre-clinical
oncology research: Only 6 (11%) were confirmed.[1]
Bayer HealthCare reported that only about 25% of published
preclinical studies could be validated.[2]
Poti Gate: Genomics Research at Duke during 2006-2010, led to the
identification of Diagnostic Signatures that spurred clinical trials. The
research was later deemed statistically flawed and the clinical trials
stopped
[1] C. Glenn Begley and Lee M. Ellis, Nature 483, 531 (2012)
[2] Prinz,F.,Schlange,T.&Asadullah,K., NatureRev. Drug Discov. 10, 712 (2011)
5. Our Solution: Open Data
Analysis Challenges
Engage the community, rather than a select company or
lab, to solve a problem in biological/medicinal research.
Obtain and expose a high value data set that would
otherwise be accessible by a few.
Require that participants share their code and document
their algorithms; test for reproducibility.
7. Measures of Impact
• 32 scientific challenges
• 50 partner institutions (since 2006)
• >5000 registered users
• 10 international conferences
• 2500 conference attendees
• >100 publications using DREAM data
• 25 journal articles
• 3 journal special issues
• 2 edited books
• 1,300 Citations
• 20 PhD theses
• Use of Challenges in Classroom as problem sets
8. Dialogue for Reverse Engineering Assessment and Methods
(DREAM) is a crowdsourcing effort that poses quantitative
challenges about systems biology modeling.
Sage Bionetworks (2009-) is a nonprofit biomedical research
organization seeking to accelerate biomedical research through
open systems, incentives, and standards.
The two organization merged in 2013 to drive a continuing
series of open science challenges.
The Organization
9. • Web services that facilitate collaborative web science
– Projects Sharing Resources (code, files, ideas)
– wiki narratives
• Analysis provenance - linking data, code, and results; data
versioning
• Web services that facilitate Challenge logistics
– Registration, acceptance of data usage, acceptance of Challenge Terms and Conditions
– Real-time challenge leaderboards
– Discussion Forum
– Formation of Teams
– Online Supplement for Challenge Papers: e.g.:
https://www.synapse.org/#!Synapse:syn2528824/wiki/
Synapse: enabling collaborative research
11. ALS is a rapidly progressing neurodegenerative disease that typically leads
to death within 3-5 years but for which disease progression is heterogeneous
across the patient population.
Data for 9000 ALS patients provided by the Pooled Resources Open-Access
Clinical Trial (PRO-ACT) database.
The challenge was to predict disease progression from clinical data.
$28,000 in prize money raised through a grass-roots fund drive
https://www.indiegogo.com/projects/fund-the-prize-solve-als-together
Nature Biotechnology agreed to publish the results.
12. In a typical challenge…
• Data is partitioned into
– training
– leaderboard
– validation
• Participants
– download training data
– apply statistical learning methods
– submit predictions
13. Organizers want to constrain submitted models to work in a certain
way:
• Model has a ‘selector’ component to select predictive clinical features
• Model has a ‘predictor’ component to predict ALS outcome based on
selected features.
Organizers want to run each model themselves to:
- Ensure models are structured as prescribed
- Ensure reproducibility of output
Docker to the rescue!
Clinical
Data
Model
Output
Selector
Selected
Features
Predictor
15. IBM Cloud with a ZEC12 system virtual
machine running a Linux server with 32
processors, 240 GB memory and 9 TB
storage space.
IBM Donates a Mainframe for ALS Challenge
16. Provision a container on a unique port for each participant. They log in as:
> ssh user_name@129.34.20.96 -p port_number
Provide a script that sends a “signal” to a process running Docker
> create_model_snapshot
Back-end process runs “docker commit” to create a copy of the model for
scoring.
Back-end reruns captured image as a new container, after mounting
leaderboard (or later, validation) data volume.
Using Docker with a Mainframe
18. • The Scientific Question: How can we reduce erroneous recall
rate (false positives)?
• Image analysis machine learning problem
• “Deep learning” algorithms expected
• $1.2M in prize money expected to attract 100s of serious
participants
• 600,000 mammography images donated (~20TB)
• Budget for 100s of GPU servers from two Cloud providers
(AWS, IBM)
19. Why use Docker?
1) Large data size
2) Sensitive data
3) Provisioned compute
20. (1) Allocate
machine (e.g.
own laptop)
(2) Retrieve
base image
(3) Retrieve
small, pilot
dataset.
(4) Create model
(5) Verify model using pilot dataset
24. • We’ve implemented the data donor’s wish to maintain control of
the data.
• We have obviated the need to download the large data set.
• We have democratized participation, making compute available
to those who might not otherwise have it.
• After the challenge we have a library of rerunnable models
ensuring reproducibility.
Outcome
25. • How best to monitor a fleet of Docker hosts (incl. GPU usage)?
• How reproducible are models run on different GPU machines?
How much of the software stack should be in the container?
• How shall we limit submitted jobs?
• Are there networking issues as models access data?
• What are the security issues when running submitted
containers?
Open questions
26. • Images aren't always portable. System Z images can't be used
on Intel-based hardware.
• Reproducibility doesn't mean comprehensibility
• Find out about all our challenges at www.synapse.org
• For those of you down in the trenches, see brucehoff/dockerauth
for an example of how to do registry delegated authorization in
Java.
/etc
27. Acknowledgements
Sage Bionetworks
Stephen Friend
Thea Norman
Lara Mangravite
Mike Kellen
Mette Peters
Arno Klein
Solly Sieberts
Abhi Pratap
Chris Bare
Bruce Hoff
IBM
Erhan Bilal
Kely Norel
Elise Blaese
Pablo Meyer Rojas
Kahn Rrhissorrakrai
EBI
Julio Saez Rodriguez
Thomas Cokelaer
Federica Eduati
Michael Menden
L. Maximilians University
Robert Kueffner,
Univ Colorado, Denver
Jim Costello
OHSU
Joe Gray
Adam Margolin
Mehmet Gonen
Laura Heiser
Prize4Life
Melanie Leitnerr
Neta Zach
NCI
Dinah Singer
Dan Gallahan
ISMMS
Eli Stahl
Gaurav Pandey
Columbia University
Andrea Califano
Mukesh Bansal
Chuck Karan
Rice University
Amina Qutub
David Noren
Byron Long
MD Anderson
Steven Kornblau
Univ of Lausanne
Daniel Marbach
Broad Institute
Bill Hahn
Barbara Weir
Aviad Tsherniak
Merck
Robert Plenge
BYU
Keoni Kauwe
OICR
Paul Boutros
UCSC
Josh Stuart
29. • Science Translational Medicine (1 paper)
• Nature Biotechnology (4 papers)
• Nature Genetics (papers in preparation)
• Nature Methods (papers in preparation)
• Nature Neuroscience (papers in preparation)
• PLoS Computational Biology (papers in review and preparation)
• National Cancer Institute (contracts for Best Performers)
Challenge Assisted Peer Review Partners
30. A crowdsourcing effort that poses quantitative challenges about systems
biology modeling and data analysis on:
Transcriptional and signaling networks,
Predictions of response to perturbations,
Translational research (tox, RA, AD, ALS, AML, …)
Our mission is
to contribute to the solution of important biomedical problems
to foster collaboration between research groups
to democratize data
to accelerate research
to objectively assess algorithm performance
What are the DREAM Challenges
31. Peer review is subjective. But even if it were not, what comes to the
reviewers may be biased:
Bias against publication of negative results or results contrary to
published results
Incentive structure put researchers under considerable pressure to try
until they find a positive result (multiple testing, over-fitting, etc.)
Dani Brunner et al., Behavioral
processes 89, 187-195 (2012)
Inflated Statistical Significance
Multiple Testing
Selective Reporting
Overfitting
32. Benefits of crowd-sourcing
• Performance Evaluation
– Unbiased, consistent, and rigorous method assessment
– Unbiased comparison and discovery of best methods
– Determine the solvability of a scientific question
• Sampling of the space of methods
– Understand the diversity of methodologies presently being
used to solve a problem
33. Benefits of crowd-sourcing
• Acceleration of Research
– The community of participants can do in 4 months what would take 10
years to any group
• Community Building
– Make high quality, well-annotated data accessible
– Foster community collaborations on fundamental research questions
– Determine robust solutions through community consensus: “The Wisdom
of Crowds”
34. • Disease research is data intensive. A typical researcher has a PhD in
multivariate statistics and does a lot of programming in languages like R,
Python, and Matlab, using libraries of established tools.
• So these analyses are software stacks of a sort, each piece having the
typical series of revisions.
• This makes reproducibility really challenging: To reproduce an analysis
you need not only the original data and the statistical processing script
written by the author, but the correct versions of all the dependencies.
• Obviously containerization offers a powerful tool for reproducibility: the
entire software stack used in an analysis can be tracked.
The challenge of reproducibility
Editor's Notes
Think in terms of mining data sets incorporating complete genomic profiles from thousands of subjects.
Today someone working in disease research may have a PhD in statistics, never see a wet lab.
Synapse provides a layer of web services that allow researchers to easily record and collaborate on their research (as widely or narrowly as they chose) in real-time and across institutional boundaries. These services include not only the Synapse web portal, but also programmatic clients which talk to the same web services.
By leveraging Synapse provenance services, analysts are able to provide an analysis trail of the data, code, and results associated with a research project. This helps all involved in the project to clearly see what has been done, and by whom.
By operating the Synapse platform and its services free of charge as a service to the scientific community, Sage Bionetworks hopes to catalyze new collaborations as well as exciting and reproducible scientific discoveries.
Then maybe just mention that Brian Bot, Chris Bare, and Thea Norman are all at the meeting and would be happy to talk to anyone interested — and that they can stop by our poster.
Synapse provides a layer of web services that allow researchers to easily record and collaborate on their research (as widely or narrowly as they chose) in real-time and across institutional boundaries. These services include not only the Synapse web portal, but also programmatic clients which talk to the same web services.
By leveraging Synapse provenance services, analysts are able to provide an analysis trail of the data, code, and results associated with a research project. This helps all involved in the project to clearly see what has been done, and by whom.
By operating the Synapse platform and its services free of charge as a service to the scientific community, Sage Bionetworks hopes to catalyze new collaborations as well as exciting and reproducible scientific discoveries.
Then maybe just mention that Brian Bot, Chris Bare, and Thea Norman are all at the meeting and would be happy to talk to anyone interested — and that they can stop by our poster.
For reproduced findings, authors had paid close attention to controls, reagents, investigator bias and describing the complete data set.
For non-reproduced findings, data were not routinely analyzed by investigators blinded to the experimental versus control groups, there are no guidelines to report all data, etc.
In the Bayer study 70% of the studies analyzed were on cancer research.