Cool Genes The
Search for a Cure
Using Genomics, Big
Data, and Docker
A non-profit 501 (c)(3) research
institute focused on
• Discovering genetic changes &
mechanisms that underlie a
variety of human diseases
• translating discoveries into new
diagnostic tests and
therapeutics.
Translational Genomics
Research Institute (TGen)
Basic Research
Understanding
of disease
Translational Research – Tgen
Development of diagnostics
and therapeutics
Clinical Application
Application to patient
By the Way
Timeline of Precision
Medicine Innovation
exomes & genomes
2,965
transcriptomes to
date in prospective
studies
1307
2006 2008 2010 2012 2014 2016 2018
2020
CLINICALTRIALS
Bipolar disorder
General Oncology
Triple Negative Breast Cancer
Various cancers
Pancreatic Cancer
Melanoma
Multiple Myeloma
Pediatric Cancer -NMTRC
Small Cell Lung Cancer
Glioblastoma
Breast Cancer
Rare Childhood Disorders
Concussion
Childhood Cancer
Why We Do This…
http://goo.gl/zinVVS
Computational time cut
from weeks to hours
Optimized, purpose-built
hardware for analyzing NGS
Get treatments to
patients faster!
IoT Ecosystem
Analytics Data StorageRemote Devices
Internet Network Gateway
IoT Devices
Analysis
Command/RFI
Downstream
Analytics
Sboner et. all. "The Cost of Sequencing: Higher Than You Think!" Genome Biology 2011, 12:125,
and thanks to Intel for the drawing
Sample collection &
experimental design
Sequencing
Data Reduction
Data Management
Downstream Analysis
Future
(Approximately 2020)
Pre-NGS
(Approximately 2000)
Now
(Approximately 2010)
0%
100%
Sequencing
Sample
Collection
Experimental
Design
Data Management
Data Reduction
Raw reads
(FASTA, FASTQ)
High-level summaries
(VCP, Peaks, RPKM)
Mapped reads
(BAM, CRAM, MRF)
Downstream Analysis
(differential expression, novel
TARs, regulatory networks,…)
Existing HPC Storage
Architecture
Sequencer Compute Archive
RAW
BCL
IRF
Lustre
Scratch
Lustre
Isilon
FASTQ
VCF
BAM Workstation S3
TGen HPC Workflows
Orchestration ApplicationCompute
Storage
Torque
Maui
Moab
Slurm
LSF
Bright
Computing
Lustre CIFS/SMB NFS DASD Isilon NASHSM/Tape
Pipeline
Ad-hoc
Reference data, STAR_GTF, GTF Mask,
DNA aligner, RNA aligner, Fusion Detector,
Indel realignment, Recalibrate, Copy
number, Allele count, Circos, Mutect,
Strelka, Snpsniff, Cuffquant, Sailfish,
IGLBEDCOV, Digar, BWA, Tophat
Java, R, STAR, bcl2fastq, blast, boost,
bowtie2, bwa, perl, python, picard-tools,
sailfish, salmon, samtools, cufflinks, bwa,
and many many others.
TGen HPC Hybrid
WorkflowsHybrid Orchestration Application
PipelineAd-hoc
Reference data, STAR_GTF, GTF Mask, DNA aligner, RNA
aligner, Fusion Detector, Indel realignment, Recalibrate,
Copy number, Allele count, Circos, Mutect, Strelka, Snpsniff,
Cuffquant, Sailfish, IGLBEDCOV, Digar, BWA, Tophat
Java, R, STAR, bcl2fastq, blast, boost, bowtie2, bwa, perl,
python, picard-tools, sailfish, salmon, samtools, cufflinks,
bwa, and many many others.
Hybrid Compute
Storage
Docker
Single and multi layer
containers
Kubernetes
Deployment with
services & pods
Flexibility on-prem
and cloud
SVCS
Basic	Pipeline
Persistence
Compute
Persistent Volumes in
Sequencing Orchestrator
BCL	
Containers
FASTQ
Containers
IRF Volumes
(Prepared Input)
VCF Volumes
(Working Set)
BAM Volumes
(Shared Format)
Archive
(File, Object)
Downstream	
Containers
Containermania
Collaborative
Data
Warehouse
PHI Data
Laboratory
Quality
Control
Data
Processing
Reporting
LabUI
Interface
Access Portal
Agile CLIA
Lab
Rapid Test
Development
Healthcare Partner
Clinical
Reports
Specimens
Consent
DataIntegration
Pharma Research
Infrastructure at
Regulatory Compliant
Data Center
Data Analytics
Minimal Infrastructure
Required
Docker Workflow RNA
Expression
Counts
Merge
fastqs
Mark
Duplicates
RNA Align
HT Seq
Merge
fastqs
Fusion
Detection
Differential
Expression
Stats/
Metrics
Expression
Counts
Future Visions….
To the cloud we go…..
Long term storage
Sustainability
Research vs clinical application
Sequencing platform(s)
Downstream is where the “good stuff” is
Disparate data types
Politics
Legal
Patient owned information
Thank you!
Acknowledgements
Portworx
DellEMC
Dr. David Craig
Dr. Matt Huentelman
Dr. Jonathan Keats
Nelson Kick
TGen HPC Team

Cool Genes: The Search for a Cure Using Genomics, Big Data, and Docker - James Lowey - CIO, TGen

  • 1.
    Cool Genes The Searchfor a Cure Using Genomics, Big Data, and Docker
  • 2.
    A non-profit 501(c)(3) research institute focused on • Discovering genetic changes & mechanisms that underlie a variety of human diseases • translating discoveries into new diagnostic tests and therapeutics. Translational Genomics Research Institute (TGen) Basic Research Understanding of disease Translational Research – Tgen Development of diagnostics and therapeutics Clinical Application Application to patient
  • 3.
  • 5.
    Timeline of Precision MedicineInnovation exomes & genomes 2,965 transcriptomes to date in prospective studies 1307 2006 2008 2010 2012 2014 2016 2018 2020 CLINICALTRIALS Bipolar disorder General Oncology Triple Negative Breast Cancer Various cancers Pancreatic Cancer Melanoma Multiple Myeloma Pediatric Cancer -NMTRC Small Cell Lung Cancer Glioblastoma Breast Cancer Rare Childhood Disorders Concussion Childhood Cancer
  • 6.
    Why We DoThis… http://goo.gl/zinVVS Computational time cut from weeks to hours Optimized, purpose-built hardware for analyzing NGS Get treatments to patients faster!
  • 8.
    IoT Ecosystem Analytics DataStorageRemote Devices Internet Network Gateway IoT Devices Analysis Command/RFI
  • 9.
    Downstream Analytics Sboner et. all."The Cost of Sequencing: Higher Than You Think!" Genome Biology 2011, 12:125, and thanks to Intel for the drawing Sample collection & experimental design Sequencing Data Reduction Data Management Downstream Analysis Future (Approximately 2020) Pre-NGS (Approximately 2000) Now (Approximately 2010) 0% 100% Sequencing Sample Collection Experimental Design Data Management Data Reduction Raw reads (FASTA, FASTQ) High-level summaries (VCP, Peaks, RPKM) Mapped reads (BAM, CRAM, MRF) Downstream Analysis (differential expression, novel TARs, regulatory networks,…)
  • 10.
    Existing HPC Storage Architecture SequencerCompute Archive RAW BCL IRF Lustre Scratch Lustre Isilon FASTQ VCF BAM Workstation S3
  • 11.
    TGen HPC Workflows OrchestrationApplicationCompute Storage Torque Maui Moab Slurm LSF Bright Computing Lustre CIFS/SMB NFS DASD Isilon NASHSM/Tape Pipeline Ad-hoc Reference data, STAR_GTF, GTF Mask, DNA aligner, RNA aligner, Fusion Detector, Indel realignment, Recalibrate, Copy number, Allele count, Circos, Mutect, Strelka, Snpsniff, Cuffquant, Sailfish, IGLBEDCOV, Digar, BWA, Tophat Java, R, STAR, bcl2fastq, blast, boost, bowtie2, bwa, perl, python, picard-tools, sailfish, salmon, samtools, cufflinks, bwa, and many many others.
  • 12.
    TGen HPC Hybrid WorkflowsHybridOrchestration Application PipelineAd-hoc Reference data, STAR_GTF, GTF Mask, DNA aligner, RNA aligner, Fusion Detector, Indel realignment, Recalibrate, Copy number, Allele count, Circos, Mutect, Strelka, Snpsniff, Cuffquant, Sailfish, IGLBEDCOV, Digar, BWA, Tophat Java, R, STAR, bcl2fastq, blast, boost, bowtie2, bwa, perl, python, picard-tools, sailfish, salmon, samtools, cufflinks, bwa, and many many others. Hybrid Compute Storage Docker Single and multi layer containers Kubernetes Deployment with services & pods Flexibility on-prem and cloud SVCS
  • 13.
    Basic Pipeline Persistence Compute Persistent Volumes in SequencingOrchestrator BCL Containers FASTQ Containers IRF Volumes (Prepared Input) VCF Volumes (Working Set) BAM Volumes (Shared Format) Archive (File, Object) Downstream Containers
  • 14.
    Containermania Collaborative Data Warehouse PHI Data Laboratory Quality Control Data Processing Reporting LabUI Interface Access Portal AgileCLIA Lab Rapid Test Development Healthcare Partner Clinical Reports Specimens Consent DataIntegration Pharma Research Infrastructure at Regulatory Compliant Data Center Data Analytics Minimal Infrastructure Required
  • 15.
    Docker Workflow RNA Expression Counts Merge fastqs Mark Duplicates RNAAlign HT Seq Merge fastqs Fusion Detection Differential Expression Stats/ Metrics Expression Counts
  • 16.
    Future Visions…. To thecloud we go….. Long term storage Sustainability Research vs clinical application Sequencing platform(s) Downstream is where the “good stuff” is Disparate data types Politics Legal Patient owned information
  • 17.
    Thank you! Acknowledgements Portworx DellEMC Dr. DavidCraig Dr. Matt Huentelman Dr. Jonathan Keats Nelson Kick TGen HPC Team