SlideShare a Scribd company logo
PACKAGING COMPUTATIONAL
BIOLOGY TOOLS FOR BROAD
DISTRIBUTION AND EASE-OF-
REUSE
Matthew Vaughn @mattdotvaughn
Director, Life Sciences Computing
Texas Advanced Computing Center
1
TACC AT A GLANCE
7/11/2016 2
Personnel
160 Full time staff (~70 PhD)
Facilities
10 MW Data center capacity
Redundant facilities for storage & hosting
Systems and Services
A Billion compute hours per year
5 Billion files, 50 Petabytes of Data,
Hundreds of Public Datasets
Capacity & Services
HPC, HTC, Visualization, Large scale
data storage, Cloud computing
Consulting, Curation and analysis, Code
optimization, Portals and Gateways, Web
service APIs, Training and Outreach
SCIENTIFIC COMPUTING: EARLY DAYS
7/11/2016 3
• C/C++/FORTRAN/PERL/SHELL
• MPI
• LAPACK/BLAS/PETSC
• SGE
• UNIX
• X86/PPC/SPARC
SCIENTIFIC COMPUTING: NOW
7/11/2016 4
LANGUAGES
• Python 2 & 3
• R
• Julia
• Perl
• Matlab
• Java
• Scala, Clojure, etc
• .NET
• C/C++
• Swift
• Haskell
• Go
• Javascript
FRAMEWORKS
• MapReduce Hadoop, Storm,
Pachyderm, Cloudera
• Event & Streaming: Kinesis,
Azure Stream Analytics, Camel,
Streambase
• Deep/Machine Learning: Watson,
Azure BI, Tensorflow, Caffe
• In-memory parsing: Kognito,
Apache Spark
• Containers: Docker, Rocket,
MESOS, Kubernetes
• Cloud: AWS, GCE, OpenStack,
vCloud, Azure
HARDWARE
•Many-core computing - 50-100
threads/node*
• Xeon / Xeon Phi
• GPU
• OpenPower
• ARM
• ShenWei
• Multi-level memory architecture
• Hierarchical storage
• FPGAs
• Quantum-like systems
NOT JUST ONE DATA TSUNAMI BUT THOUSANDS OF THEM
7/11/2016 5
EMERGENCE OF CLOUD TECHNOLOGY AND BUSINESS MODELS
7/11/2016 6
Mike
• Computing novice
• Works remotely at partner
site
Eliza
• Masters specific
computing analysis
skills
• Readily adopts new
technology
Roshan
• Computationally
experienced
• Focused on interpretation
Paulo
• Staff computational
expert
• Supports multiple
projects
DEMOCRATIZED COMPUTING
7/11/2016 7
Nikolaidas Group
• Mostly
experimentalists
• Strict data sharing
& access rules
THEIR NEEDS (30,000 FT VIEW)
 Store, organize, share measured data
 Do (and re-do) perform 1’ analyses
 Store, organize, share derived data results
 Invent and explore hypotheses
 Share analysis code with the scientific public
 Integrate results from new experiments
 Publish data and plots, visualizations and analysis
tools
7/11/2016 8
THEIR NEEDS (500 FT VIEW)
 Data lifecycle management
 Fine-grained permissions and roles
 Discoverability
 Version control
 Domesticating promising new analysis codes
 Wrestling with immature technology
 Making their science internally reproducible
 Adopting efficient analytical methods
7/11/2016 9
HOW DO WE HELP RESEARCHERS WITH
DIVERSE NEEDS AND BACKGROUNDS KEEP
UP WITH ADVANCES IN COMPUTATIONAL
BIOLOGY?
7/11/2016 10
ESSENTIAL DRIVERS
7/11/2016 11
• I need to customize my research
environment
• I need to share my research
environment with others
• I need to teach others how to use
my research environment
CUSTOMIZING THE HPC ENVIRONMENT
7/11/2016 12
INSTALL PACKAGE X, VERSION 1.0.1 ON RESOURCE Y?
TACC Life Sciences Strategy
• Maintain public Github repo of build
instructions for RPMs
• Contributions accepted via pull request or
direct from trusted partner
• TACC staff build, test, install RPMs
• RPMs are relocatable, stored for re-use in
future
• Software is discoverable via modules
Similar concept to Rocks Rolls. See also…
BioLinux.
TACC LSC spec repository
PROS AND CONS
7/11/2016 13
PROS
• Direct path for community contribution
• Digital footprint is small: Spec files + RPM artifacts
• Build instructions are discoverable, version controlled, attributable
• Resulting digital artifacts are reusable on the host system
CONS
• Learning curve is steep. Need to master RPM build system plus host-specific
build systems
• TACC staff still have to build/test/deploy
• SPEC files have dialects. Impedes portability between systems.
• RPMs only work on target system.
VERDICT: NOT HELPFUL AT ALL FOR TRAINING. DOES IMPROVE THE WORK ENVIRONMENT.
SHAREABLE VM IMAGES AND DATA
7/11/2016 14
OpenStack + Science-oriented GUI
• Cyverse Atmosphere (UA/TACC)
• XSEDE Jetstream (IU/TACC)
• NSF Chameleon (UC/TACC)
• Users extend and share VM images
configured with specific software
• Users can share data volumes
• VMs can be set up multi-user
• VMs can be exported and published
• 95% self-service!
Screenshot from Jetstream, early user period
PROVIDE A COMPLETE COMPUTER THAT CAN RUN PACKAGE X, VERSION 1.0.1
PROS AND CONS
7/11/2016 15
PROS
• Entirely self-service VM image publishing
• Versioned VM images plus extra metadata is good for reproducible
analysis
• Learning curve for USING the cloud system is shallow
• Learning curve for building and sharing VMs is modest
• Multiple operating systems are supported
CONS
• Portability between cloud providers is limited
• Orchestrating 2+ VMs is very difficult
• Digital footprint is the application plus OS, per version
• Requires either purchase of commercial cloud capacity or sustained
investment in private cloud (OS/Euc/Nebula)
VERDICT: QUITE HELPFUL FOR TRAINING OR SOFTWARE SHARING, BUT…
CONTAINER TECHNOLOGIES
7/11/2016 16
DockerHub is one public repository
for images
Cyverse and other TACC projects use
Docker
• Package software + dependencies
+ essential configuration
• Based on a file (Dockerfile, for
instance) that can be shared and
maintained under version control
• Synergistic with VM and web
service approaches
• Unites deployment strategy for a
platform and its hosted software
Exemplar Dockerfile for teaching
the basics of containerization
PROVIDE ONLY THE FILES & DEPENDENCIES FOR PACKAGE X, VERSION 1.0.1
PROS AND CONS
7/11/2016 17
PROS
• Minimal storage footprint – just the deltas beyond base operating system
footprint
• Instructions for building images are plaintext files in a DSL
• Images can be made FAIR via standards and public repositories
• Students eagerly adopt docker run x:1.0.1 command
CONS
• Usage requires reasonable understanding of key Linux technologies
• Images themselves lack documentation and validation
• Software must be based on Linux kernel 3.10+
• Not optimized for large or persistent data
• Most public-funded infrastructure does not support containers
VERDICT: HELPFUL FOR SHARING AND TEACHING, IF USED & RESOURCED PROPERLY
WEB SERVICE APIS
7/11/2016 18
TACC/Cyverse Agave API
• Package scientific applications as
nodes in a workflow
• Model 1' and 2' data + host
resources as dependencies
• Handle validation+invocation, data
marshaling+management, resource
brokering+coordination,
identity+access for the user
• Use either GUIs, language libraries,
or direct service calls for access
DNA Subway runs real-scale NGS tools on TACC & AWS
PROVIDE ABSTRACT ABILITY TO RUN PACKAGE X, VERSION 1.0.1 ON RESOURCE Y
PROS AND CONS
7/11/2016 19
PROS
• Digital footprint is small: Application + dependencies only
• Compute and storage protocols are abstracted from end users
• Provenance is captured and maintained
CONS
• Requires non-trivial investment by tool developers
• Direct usage (not via portal) is challenging for end users
• Expectation management and user support for end users is
challenging
• Requires sustained investment in maintaining the services platform in
addition to its resources
VERDICT: HELPFUL TO ENABLE USE OF REAL-SCALE TOOLS & DATA
JUPYTER NOTEBOOKS
7/11/2016 20
TACC provides Jupyter Notebook support on its
HPC systems and hosted JupyterHub in its API
platform
• Narrative, rich-text interface with embedded
media
• In-line code execution for key languages
such as Python, Julia, Go, Haskell, R, Torch
• Interactive, exploratory computing
• Notebooks themselves are self-contained
sharable documents
Part of a larger strategy to support interactive
mode computing: Rstudio, Mathematica, SAGE,
and Matlab
Loading the Iris Data Set, used for teaching statistical
classification techniques, into a Jupyter notebook
DOCUMENT ALL THE STEPS AROUND RUNNING PACKAGE X, VERSION 1.0.1
PROS AND CONS
7/11/2016 21
PROS
• Digital footprint is small: Notebooks are text docs
• Narrative support plus interactive widgets are first-class aspects
• Closest analog yet to team programming (but async!)
CONS
• Notebook infrastructure itself needs resourcing
• Notebooks don’t natively provide the software or data dependencies
for a given workflow
• They also don’t scale (natively)
• Need to train end users in how to use Notebooks + the specific
bioinformatics skills
VERDICT: FANTASTIC PLATFORM TO AUTHOR INTERACTIVE TUTORIALS & GUIDES, BUT…
NO SILVER BULLETS
7/11/2016 22
 Installable RPMs aren’t that easy to build
 VM s aren’t always portable or scalable
 Jupyter Notebooks need pre-configured hosts
 Containers need support for persistent data
 The cloud is an abstraction for “someone else’s
problem”
THESE TOOLS COMPLEMENT ONE ANOTHER
Bioinformatics training
ENSEMBLES VS ONE-MAN BANDS
7/11/2016 23
AKES01: Clouds, Clusters, and Containers: Tools for responsible,
collaborative computing
• JupyterHub hosted on XSEDE Jetstream
• Multiuser Docker host hosted on XSEDE Jetstream
• Web shell for access even from tablets
• Jupyter notebooks pre-configured to interact with Agave APIs
• Storage on Cyverse Data Store cloud
• Application registry @ Cyverse App catalog
• Using DockerHub & Github commercial resources
Skills focus
• Containerization 101 + Web services 101
• Sharing and publishing data, code, and results
• Provenance and other metadata
John Fonner from TACC
demonstrating file provenance within
Cyverse
QUESTIONS AND DISCUSSIONS
7/11/2016 24

More Related Content

What's hot

ISC Cloud'13 - Hands-On Tutorial on “Building Your Cloud for HPC, Here & Now,...
ISC Cloud'13 - Hands-On Tutorial on “Building Your Cloud for HPC, Here & Now,...ISC Cloud'13 - Hands-On Tutorial on “Building Your Cloud for HPC, Here & Now,...
ISC Cloud'13 - Hands-On Tutorial on “Building Your Cloud for HPC, Here & Now,...
OpenNebula Project
 

What's hot (20)

Delivering Infrastructure-as-a-Service with Open Source Software
Delivering Infrastructure-as-a-Service with Open Source SoftwareDelivering Infrastructure-as-a-Service with Open Source Software
Delivering Infrastructure-as-a-Service with Open Source Software
 
Discover the all new Mesosphere DC/OS 1.10
Discover the all new Mesosphere DC/OS 1.10Discover the all new Mesosphere DC/OS 1.10
Discover the all new Mesosphere DC/OS 1.10
 
HPC at NIBR
HPC at NIBRHPC at NIBR
HPC at NIBR
 
DuraCloud Integration with SDSC Presentation Slides
DuraCloud Integration with SDSC Presentation SlidesDuraCloud Integration with SDSC Presentation Slides
DuraCloud Integration with SDSC Presentation Slides
 
Containerization Principles Overview for app development and deployment
Containerization Principles Overview for app development and deploymentContainerization Principles Overview for app development and deployment
Containerization Principles Overview for app development and deployment
 
KarthikSNOW_CV
KarthikSNOW_CVKarthikSNOW_CV
KarthikSNOW_CV
 
ISC Cloud'13 - Hands-On Tutorial on “Building Your Cloud for HPC, Here & Now,...
ISC Cloud'13 - Hands-On Tutorial on “Building Your Cloud for HPC, Here & Now,...ISC Cloud'13 - Hands-On Tutorial on “Building Your Cloud for HPC, Here & Now,...
ISC Cloud'13 - Hands-On Tutorial on “Building Your Cloud for HPC, Here & Now,...
 
Chicago HUG Presentation Oct 2011
Chicago HUG Presentation Oct 2011Chicago HUG Presentation Oct 2011
Chicago HUG Presentation Oct 2011
 
Jelastic Features 2.x
Jelastic Features 2.xJelastic Features 2.x
Jelastic Features 2.x
 
Designing and Implementing a cloud-hosted SaaS for data movement and Sharing ...
Designing and Implementing a cloud-hosted SaaS for data movement and Sharing ...Designing and Implementing a cloud-hosted SaaS for data movement and Sharing ...
Designing and Implementing a cloud-hosted SaaS for data movement and Sharing ...
 
Hadoop Everywhere & Cloudbreak
Hadoop Everywhere & CloudbreakHadoop Everywhere & Cloudbreak
Hadoop Everywhere & Cloudbreak
 
S2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real WorldS2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real World
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
 
Presentation linux on power
Presentation   linux on powerPresentation   linux on power
Presentation linux on power
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioning
 
HPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance OptimizationHPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance Optimization
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
How to Ingest 16 Billion Records Per Day into your Hadoop Environment
How to Ingest 16 Billion Records Per Day into your Hadoop EnvironmentHow to Ingest 16 Billion Records Per Day into your Hadoop Environment
How to Ingest 16 Billion Records Per Day into your Hadoop Environment
 
Cloud Computing Architecture with Open Nebula - HPC Cloud Use Cases - NASA A...
Cloud Computing Architecture with Open Nebula  - HPC Cloud Use Cases - NASA A...Cloud Computing Architecture with Open Nebula  - HPC Cloud Use Cases - NASA A...
Cloud Computing Architecture with Open Nebula - HPC Cloud Use Cases - NASA A...
 
Enabling Fast IT using Containers, Microservices and DevOps Model
Enabling Fast IT using Containers, Microservices and DevOps ModelEnabling Fast IT using Containers, Microservices and DevOps Model
Enabling Fast IT using Containers, Microservices and DevOps Model
 

Viewers also liked

What's next? La Fine dell'Era del Buon Senso
What's next? La Fine dell'Era del Buon SensoWhat's next? La Fine dell'Era del Buon Senso
What's next? La Fine dell'Era del Buon Senso
Stefano Maruzzi
 
Newhouse - APTA 2016 - Slides with notes
Newhouse - APTA 2016 - Slides with notesNewhouse - APTA 2016 - Slides with notes
Newhouse - APTA 2016 - Slides with notes
Stephen Newhouse
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger institute
Matt Massie
 
RNA-seq based Genome Annotation with mGene.ngs and MiTie
RNA-seq based Genome Annotation with mGene.ngs and MiTieRNA-seq based Genome Annotation with mGene.ngs and MiTie
RNA-seq based Genome Annotation with mGene.ngs and MiTie
Gunnar Rätsch
 

Viewers also liked (20)

What's next? La Fine dell'Era del Buon Senso
What's next? La Fine dell'Era del Buon SensoWhat's next? La Fine dell'Era del Buon Senso
What's next? La Fine dell'Era del Buon Senso
 
Beyond Scrum: Scaling Agile with Continuous Delivery and Subversion
Beyond Scrum: Scaling Agile with Continuous Delivery and SubversionBeyond Scrum: Scaling Agile with Continuous Delivery and Subversion
Beyond Scrum: Scaling Agile with Continuous Delivery and Subversion
 
Newhouse - APTA 2016 - Slides with notes
Newhouse - APTA 2016 - Slides with notesNewhouse - APTA 2016 - Slides with notes
Newhouse - APTA 2016 - Slides with notes
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqIntroduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-Seq
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAM
 
Next-Generation Sequencing and its Applications in RNA-Seq
Next-Generation Sequencing and its Applications in RNA-SeqNext-Generation Sequencing and its Applications in RNA-Seq
Next-Generation Sequencing and its Applications in RNA-Seq
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger institute
 
Reproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and AndurilReproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and Anduril
 
Introduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seqIntroduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seq
 
RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities
 
Docker @ FOSS4G 2016, Bonn
Docker @ FOSS4G 2016, BonnDocker @ FOSS4G 2016, Bonn
Docker @ FOSS4G 2016, Bonn
 
RNA-seq based Genome Annotation with mGene.ngs and MiTie
RNA-seq based Genome Annotation with mGene.ngs and MiTieRNA-seq based Genome Annotation with mGene.ngs and MiTie
RNA-seq based Genome Annotation with mGene.ngs and MiTie
 
Next generation sequencing methods
Next generation sequencing methods Next generation sequencing methods
Next generation sequencing methods
 
A next Generation Sequencing Approach to Detect Large Rearrangements in BRCA1...
A next Generation Sequencing Approach to Detect Large Rearrangements in BRCA1...A next Generation Sequencing Approach to Detect Large Rearrangements in BRCA1...
A next Generation Sequencing Approach to Detect Large Rearrangements in BRCA1...
 
Ngs introduction
Ngs introductionNgs introduction
Ngs introduction
 
Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?
 
Reproducible Computational Pipelines with Docker and Nextflow
Reproducible Computational Pipelines with Docker and NextflowReproducible Computational Pipelines with Docker and Nextflow
Reproducible Computational Pipelines with Docker and Nextflow
 
Kogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysisKogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysis
 
Application Note: A Simple One-Step Library Prep Method To Enable AmpliSeq Pa...
Application Note: A Simple One-Step Library Prep Method To Enable AmpliSeq Pa...Application Note: A Simple One-Step Library Prep Method To Enable AmpliSeq Pa...
Application Note: A Simple One-Step Library Prep Method To Enable AmpliSeq Pa...
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis Overview
 

Similar to Packaging computational biology tools for broad distribution and ease-of-reuse

Red hat ceph storage customer presentation
Red hat ceph storage customer presentationRed hat ceph storage customer presentation
Red hat ceph storage customer presentation
Rodrigo Missiaggia
 
Desktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omicsDesktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omics
David Wallom
 

Similar to Packaging computational biology tools for broad distribution and ease-of-reuse (20)

Supporting Research through "Desktop as a Service" models of e-infrastructure...
Supporting Research through "Desktop as a Service" models of e-infrastructure...Supporting Research through "Desktop as a Service" models of e-infrastructure...
Supporting Research through "Desktop as a Service" models of e-infrastructure...
 
Scaling People, Not Just Systems, to Take On Big Data Challenges
Scaling People, Not Just Systems, to Take On Big Data ChallengesScaling People, Not Just Systems, to Take On Big Data Challenges
Scaling People, Not Just Systems, to Take On Big Data Challenges
 
e-infrastructural needs to support informatics
e-infrastructural needs to support informaticse-infrastructural needs to support informatics
e-infrastructural needs to support informatics
 
Red hat ceph storage customer presentation
Red hat ceph storage customer presentationRed hat ceph storage customer presentation
Red hat ceph storage customer presentation
 
Desktop as a Service supporting Environmental 'Omics
Desktop as a Service supporting Environmental 'OmicsDesktop as a Service supporting Environmental 'Omics
Desktop as a Service supporting Environmental 'Omics
 
Peanut Butter and jelly: Mapping the deep Integration between Ceph and OpenStack
Peanut Butter and jelly: Mapping the deep Integration between Ceph and OpenStackPeanut Butter and jelly: Mapping the deep Integration between Ceph and OpenStack
Peanut Butter and jelly: Mapping the deep Integration between Ceph and OpenStack
 
Choosing PaaS: Cisco and Open Source Options: an overview
Choosing PaaS:  Cisco and Open Source Options: an overviewChoosing PaaS:  Cisco and Open Source Options: an overview
Choosing PaaS: Cisco and Open Source Options: an overview
 
Ceph, Open Source, and the Path to Ubiquity in Storage - AACS Meetup 2014
Ceph, Open Source, and the Path to Ubiquity in Storage - AACS Meetup 2014Ceph, Open Source, and the Path to Ubiquity in Storage - AACS Meetup 2014
Ceph, Open Source, and the Path to Ubiquity in Storage - AACS Meetup 2014
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
OpenStack Icehouse Overview
OpenStack Icehouse OverviewOpenStack Icehouse Overview
OpenStack Icehouse Overview
 
Prashant Vichare Resume
Prashant Vichare ResumePrashant Vichare Resume
Prashant Vichare Resume
 
The Next Generation Datacenter
The Next Generation DatacenterThe Next Generation Datacenter
The Next Generation Datacenter
 
SCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation Environments
 
Openstack - An introduction/Installation - Presented at Dr Dobb's conference...
 Openstack - An introduction/Installation - Presented at Dr Dobb's conference... Openstack - An introduction/Installation - Presented at Dr Dobb's conference...
Openstack - An introduction/Installation - Presented at Dr Dobb's conference...
 
Open Marketing Meeting 03/27/2013
Open Marketing Meeting 03/27/2013Open Marketing Meeting 03/27/2013
Open Marketing Meeting 03/27/2013
 
Desktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omicsDesktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omics
 
Orchestrating stateful applications with PKS and Portworx
Orchestrating stateful applications with PKS and PortworxOrchestrating stateful applications with PKS and Portworx
Orchestrating stateful applications with PKS and Portworx
 
Orchestrating Stateful Applications with PKS and Portworx
Orchestrating Stateful Applications with PKS and PortworxOrchestrating Stateful Applications with PKS and Portworx
Orchestrating Stateful Applications with PKS and Portworx
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 

More from Matthew Vaughn

How Cyverse.org enables scalable data discoverability and re-use
How Cyverse.org enables scalable data discoverability and re-useHow Cyverse.org enables scalable data discoverability and re-use
How Cyverse.org enables scalable data discoverability and re-use
Matthew Vaughn
 

More from Matthew Vaughn (15)

On-Demand Cloud Computing for Life Sciences Research and Education
On-Demand Cloud Computing for Life Sciences Research and EducationOn-Demand Cloud Computing for Life Sciences Research and Education
On-Demand Cloud Computing for Life Sciences Research and Education
 
Towards a (united) federation of Bioinformatics resources
Towards a (united) federation of Bioinformatics resourcesTowards a (united) federation of Bioinformatics resources
Towards a (united) federation of Bioinformatics resources
 
CYVERSE: TRANSFORMING LIFE SCIENCE RESEARCH VIA CYBERINFRASTRUCTURE
CYVERSE: TRANSFORMING LIFE SCIENCE RESEARCH VIA CYBERINFRASTRUCTURECYVERSE: TRANSFORMING LIFE SCIENCE RESEARCH VIA CYBERINFRASTRUCTURE
CYVERSE: TRANSFORMING LIFE SCIENCE RESEARCH VIA CYBERINFRASTRUCTURE
 
Jetstream: Accessible cloud computing for the national science and engineerin...
Jetstream: Accessible cloud computing for the national science and engineerin...Jetstream: Accessible cloud computing for the national science and engineerin...
Jetstream: Accessible cloud computing for the national science and engineerin...
 
How Cyverse.org enables scalable data discoverability and re-use
How Cyverse.org enables scalable data discoverability and re-useHow Cyverse.org enables scalable data discoverability and re-use
How Cyverse.org enables scalable data discoverability and re-use
 
Clouds, Clusters, and Containers: Tools for responsible, collaborative computing
Clouds, Clusters, and Containers: Tools for responsible, collaborative computingClouds, Clusters, and Containers: Tools for responsible, collaborative computing
Clouds, Clusters, and Containers: Tools for responsible, collaborative computing
 
Jetstream: Adding Cloud-based Computing to the National Cyberinfrastructure
Jetstream: Adding Cloud-based Computing to the National CyberinfrastructureJetstream: Adding Cloud-based Computing to the National Cyberinfrastructure
Jetstream: Adding Cloud-based Computing to the National Cyberinfrastructure
 
Arabidopsis Information Portal: A Community-Extensible Platform for Open Data
Arabidopsis Information Portal: A Community-Extensible Platform for Open DataArabidopsis Information Portal: A Community-Extensible Platform for Open Data
Arabidopsis Information Portal: A Community-Extensible Platform for Open Data
 
Developing Apps: Exposing Your Data Through Araport
Developing Apps: Exposing Your Data Through AraportDeveloping Apps: Exposing Your Data Through Araport
Developing Apps: Exposing Your Data Through Araport
 
Dinosaur bioinformatics
Dinosaur bioinformaticsDinosaur bioinformatics
Dinosaur bioinformatics
 
aip-developer-intro_pag2015
aip-developer-intro_pag2015aip-developer-intro_pag2015
aip-developer-intro_pag2015
 
iplant-highlights-pag2015
iplant-highlights-pag2015iplant-highlights-pag2015
iplant-highlights-pag2015
 
aip-workshop1-dev-tutorial
aip-workshop1-dev-tutorialaip-workshop1-dev-tutorial
aip-workshop1-dev-tutorial
 
aip_developer_overview_icar_2014
aip_developer_overview_icar_2014aip_developer_overview_icar_2014
aip_developer_overview_icar_2014
 
Arabidopsis Information Portal overview from Plant Biology Europe 2014
Arabidopsis Information Portal overview from Plant Biology Europe 2014Arabidopsis Information Portal overview from Plant Biology Europe 2014
Arabidopsis Information Portal overview from Plant Biology Europe 2014
 

Recently uploaded

ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
muralinath2
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
AADYARAJPANDEY1
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
muralinath2
 
Penicillin...........................pptx
Penicillin...........................pptxPenicillin...........................pptx
Penicillin...........................pptx
Cherry
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditions
muralinath2
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
The importance of continents, oceans and plate tectonics for the evolution of...
The importance of continents, oceans and plate tectonics for the evolution of...The importance of continents, oceans and plate tectonics for the evolution of...
The importance of continents, oceans and plate tectonics for the evolution of...
Sérgio Sacani
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
DiyaBiswas10
 

Recently uploaded (20)

Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
 
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiologyBLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
INSIGHT Partner Profile: Tampere University
INSIGHT Partner Profile: Tampere UniversityINSIGHT Partner Profile: Tampere University
INSIGHT Partner Profile: Tampere University
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
 
Penicillin...........................pptx
Penicillin...........................pptxPenicillin...........................pptx
Penicillin...........................pptx
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditions
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
The importance of continents, oceans and plate tectonics for the evolution of...
The importance of continents, oceans and plate tectonics for the evolution of...The importance of continents, oceans and plate tectonics for the evolution of...
The importance of continents, oceans and plate tectonics for the evolution of...
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptxGLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
 

Packaging computational biology tools for broad distribution and ease-of-reuse

  • 1. PACKAGING COMPUTATIONAL BIOLOGY TOOLS FOR BROAD DISTRIBUTION AND EASE-OF- REUSE Matthew Vaughn @mattdotvaughn Director, Life Sciences Computing Texas Advanced Computing Center 1
  • 2. TACC AT A GLANCE 7/11/2016 2 Personnel 160 Full time staff (~70 PhD) Facilities 10 MW Data center capacity Redundant facilities for storage & hosting Systems and Services A Billion compute hours per year 5 Billion files, 50 Petabytes of Data, Hundreds of Public Datasets Capacity & Services HPC, HTC, Visualization, Large scale data storage, Cloud computing Consulting, Curation and analysis, Code optimization, Portals and Gateways, Web service APIs, Training and Outreach
  • 3. SCIENTIFIC COMPUTING: EARLY DAYS 7/11/2016 3 • C/C++/FORTRAN/PERL/SHELL • MPI • LAPACK/BLAS/PETSC • SGE • UNIX • X86/PPC/SPARC
  • 4. SCIENTIFIC COMPUTING: NOW 7/11/2016 4 LANGUAGES • Python 2 & 3 • R • Julia • Perl • Matlab • Java • Scala, Clojure, etc • .NET • C/C++ • Swift • Haskell • Go • Javascript FRAMEWORKS • MapReduce Hadoop, Storm, Pachyderm, Cloudera • Event & Streaming: Kinesis, Azure Stream Analytics, Camel, Streambase • Deep/Machine Learning: Watson, Azure BI, Tensorflow, Caffe • In-memory parsing: Kognito, Apache Spark • Containers: Docker, Rocket, MESOS, Kubernetes • Cloud: AWS, GCE, OpenStack, vCloud, Azure HARDWARE •Many-core computing - 50-100 threads/node* • Xeon / Xeon Phi • GPU • OpenPower • ARM • ShenWei • Multi-level memory architecture • Hierarchical storage • FPGAs • Quantum-like systems
  • 5. NOT JUST ONE DATA TSUNAMI BUT THOUSANDS OF THEM 7/11/2016 5
  • 6. EMERGENCE OF CLOUD TECHNOLOGY AND BUSINESS MODELS 7/11/2016 6
  • 7. Mike • Computing novice • Works remotely at partner site Eliza • Masters specific computing analysis skills • Readily adopts new technology Roshan • Computationally experienced • Focused on interpretation Paulo • Staff computational expert • Supports multiple projects DEMOCRATIZED COMPUTING 7/11/2016 7 Nikolaidas Group • Mostly experimentalists • Strict data sharing & access rules
  • 8. THEIR NEEDS (30,000 FT VIEW)  Store, organize, share measured data  Do (and re-do) perform 1’ analyses  Store, organize, share derived data results  Invent and explore hypotheses  Share analysis code with the scientific public  Integrate results from new experiments  Publish data and plots, visualizations and analysis tools 7/11/2016 8
  • 9. THEIR NEEDS (500 FT VIEW)  Data lifecycle management  Fine-grained permissions and roles  Discoverability  Version control  Domesticating promising new analysis codes  Wrestling with immature technology  Making their science internally reproducible  Adopting efficient analytical methods 7/11/2016 9
  • 10. HOW DO WE HELP RESEARCHERS WITH DIVERSE NEEDS AND BACKGROUNDS KEEP UP WITH ADVANCES IN COMPUTATIONAL BIOLOGY? 7/11/2016 10
  • 11. ESSENTIAL DRIVERS 7/11/2016 11 • I need to customize my research environment • I need to share my research environment with others • I need to teach others how to use my research environment
  • 12. CUSTOMIZING THE HPC ENVIRONMENT 7/11/2016 12 INSTALL PACKAGE X, VERSION 1.0.1 ON RESOURCE Y? TACC Life Sciences Strategy • Maintain public Github repo of build instructions for RPMs • Contributions accepted via pull request or direct from trusted partner • TACC staff build, test, install RPMs • RPMs are relocatable, stored for re-use in future • Software is discoverable via modules Similar concept to Rocks Rolls. See also… BioLinux. TACC LSC spec repository
  • 13. PROS AND CONS 7/11/2016 13 PROS • Direct path for community contribution • Digital footprint is small: Spec files + RPM artifacts • Build instructions are discoverable, version controlled, attributable • Resulting digital artifacts are reusable on the host system CONS • Learning curve is steep. Need to master RPM build system plus host-specific build systems • TACC staff still have to build/test/deploy • SPEC files have dialects. Impedes portability between systems. • RPMs only work on target system. VERDICT: NOT HELPFUL AT ALL FOR TRAINING. DOES IMPROVE THE WORK ENVIRONMENT.
  • 14. SHAREABLE VM IMAGES AND DATA 7/11/2016 14 OpenStack + Science-oriented GUI • Cyverse Atmosphere (UA/TACC) • XSEDE Jetstream (IU/TACC) • NSF Chameleon (UC/TACC) • Users extend and share VM images configured with specific software • Users can share data volumes • VMs can be set up multi-user • VMs can be exported and published • 95% self-service! Screenshot from Jetstream, early user period PROVIDE A COMPLETE COMPUTER THAT CAN RUN PACKAGE X, VERSION 1.0.1
  • 15. PROS AND CONS 7/11/2016 15 PROS • Entirely self-service VM image publishing • Versioned VM images plus extra metadata is good for reproducible analysis • Learning curve for USING the cloud system is shallow • Learning curve for building and sharing VMs is modest • Multiple operating systems are supported CONS • Portability between cloud providers is limited • Orchestrating 2+ VMs is very difficult • Digital footprint is the application plus OS, per version • Requires either purchase of commercial cloud capacity or sustained investment in private cloud (OS/Euc/Nebula) VERDICT: QUITE HELPFUL FOR TRAINING OR SOFTWARE SHARING, BUT…
  • 16. CONTAINER TECHNOLOGIES 7/11/2016 16 DockerHub is one public repository for images Cyverse and other TACC projects use Docker • Package software + dependencies + essential configuration • Based on a file (Dockerfile, for instance) that can be shared and maintained under version control • Synergistic with VM and web service approaches • Unites deployment strategy for a platform and its hosted software Exemplar Dockerfile for teaching the basics of containerization PROVIDE ONLY THE FILES & DEPENDENCIES FOR PACKAGE X, VERSION 1.0.1
  • 17. PROS AND CONS 7/11/2016 17 PROS • Minimal storage footprint – just the deltas beyond base operating system footprint • Instructions for building images are plaintext files in a DSL • Images can be made FAIR via standards and public repositories • Students eagerly adopt docker run x:1.0.1 command CONS • Usage requires reasonable understanding of key Linux technologies • Images themselves lack documentation and validation • Software must be based on Linux kernel 3.10+ • Not optimized for large or persistent data • Most public-funded infrastructure does not support containers VERDICT: HELPFUL FOR SHARING AND TEACHING, IF USED & RESOURCED PROPERLY
  • 18. WEB SERVICE APIS 7/11/2016 18 TACC/Cyverse Agave API • Package scientific applications as nodes in a workflow • Model 1' and 2' data + host resources as dependencies • Handle validation+invocation, data marshaling+management, resource brokering+coordination, identity+access for the user • Use either GUIs, language libraries, or direct service calls for access DNA Subway runs real-scale NGS tools on TACC & AWS PROVIDE ABSTRACT ABILITY TO RUN PACKAGE X, VERSION 1.0.1 ON RESOURCE Y
  • 19. PROS AND CONS 7/11/2016 19 PROS • Digital footprint is small: Application + dependencies only • Compute and storage protocols are abstracted from end users • Provenance is captured and maintained CONS • Requires non-trivial investment by tool developers • Direct usage (not via portal) is challenging for end users • Expectation management and user support for end users is challenging • Requires sustained investment in maintaining the services platform in addition to its resources VERDICT: HELPFUL TO ENABLE USE OF REAL-SCALE TOOLS & DATA
  • 20. JUPYTER NOTEBOOKS 7/11/2016 20 TACC provides Jupyter Notebook support on its HPC systems and hosted JupyterHub in its API platform • Narrative, rich-text interface with embedded media • In-line code execution for key languages such as Python, Julia, Go, Haskell, R, Torch • Interactive, exploratory computing • Notebooks themselves are self-contained sharable documents Part of a larger strategy to support interactive mode computing: Rstudio, Mathematica, SAGE, and Matlab Loading the Iris Data Set, used for teaching statistical classification techniques, into a Jupyter notebook DOCUMENT ALL THE STEPS AROUND RUNNING PACKAGE X, VERSION 1.0.1
  • 21. PROS AND CONS 7/11/2016 21 PROS • Digital footprint is small: Notebooks are text docs • Narrative support plus interactive widgets are first-class aspects • Closest analog yet to team programming (but async!) CONS • Notebook infrastructure itself needs resourcing • Notebooks don’t natively provide the software or data dependencies for a given workflow • They also don’t scale (natively) • Need to train end users in how to use Notebooks + the specific bioinformatics skills VERDICT: FANTASTIC PLATFORM TO AUTHOR INTERACTIVE TUTORIALS & GUIDES, BUT…
  • 22. NO SILVER BULLETS 7/11/2016 22  Installable RPMs aren’t that easy to build  VM s aren’t always portable or scalable  Jupyter Notebooks need pre-configured hosts  Containers need support for persistent data  The cloud is an abstraction for “someone else’s problem” THESE TOOLS COMPLEMENT ONE ANOTHER Bioinformatics training
  • 23. ENSEMBLES VS ONE-MAN BANDS 7/11/2016 23 AKES01: Clouds, Clusters, and Containers: Tools for responsible, collaborative computing • JupyterHub hosted on XSEDE Jetstream • Multiuser Docker host hosted on XSEDE Jetstream • Web shell for access even from tablets • Jupyter notebooks pre-configured to interact with Agave APIs • Storage on Cyverse Data Store cloud • Application registry @ Cyverse App catalog • Using DockerHub & Github commercial resources Skills focus • Containerization 101 + Web services 101 • Sharing and publishing data, code, and results • Provenance and other metadata John Fonner from TACC demonstrating file provenance within Cyverse

Editor's Notes

  1. Driven by BIG DATA and its relentless adoption by non-traditional users of Scientific Computing
  2. Everyone’s generating TERASCALE DATA Data spread is real
  3. NOBODY WANTS TO DO CAPEX
  4. EVERYBODY IS COMPUTING
  5. Make technologies available to them. Train folks how to use.
  6. The drivers for computational biology are the same as for laboratory science
  7. Less than half of Cyverse web service tools contributed by 1’ author. I’ll wager similar number at Galaxy toolshed, various Docker repos for science.
  8. Less than half of Cyverse web service tools contributed by 1’ author. I’ll wager similar number at Galaxy toolshed, various Docker repos for science.