Large-scale Genomic Analysis
Enabled by Gordon!
Kristopher Standish*^, Tristan M. Carland*,
Glenn K. Lockwood+^, Mahidhar ...
Background!
•  Janssen R&D performed whole-genome
sequencing on 438 patients undergoing
treatment for rheumatoid arthritis...
Technical Challenges!
•  Data Volume: raw reads from 438 full human
genomes!
•  50 TB of compressed data from Janssen R&D!...
Read Mapping Pipeline: Looks Uniform from
Traditional HPC Perspective...!
Thread-level Parallelism

Map (BWA)
sam to bam (...
Read Mapping Pipeline: Non-Traditional
Bottlenecks (DRAM & IO)!
Memory Requirement

Map (BWA)
sam to bam (SAMtools)
Merge ...
Sort Step: Bound by Disk IO and
Capacity!
Problem: 16 threads require...!
•  25 GB DRAM!
•  3.5 TB local disk!
•  1.6 TB i...
Group Variant Calling Pipeline!

Walltime

Thread-level Parallelism

•  Massive data
reduction at first
step!
•  Reduction ...
Footprint on Gordon: CPUs and Storage Used!
257 TB Lustre
scratch used at peak
!

SAN DIEGO SUPERCOMPUTER CENTER

5,000 co...
Time to Completion...!
•  Overall: !
•  36 core-years of compute used in 6 weeks—equivalent
to 310 cores running 24/7!
•  ...
Acknowledgements
•  Chris Huang

•  Ed Jaeger

•  Sarah Lamberth

•  Lance Smith

•  Zhenya Cherkas

•  Martin Dellwo

•  ...
Upcoming SlideShare
Loading in …5
×

Large-scale Genomic Analysis Enabled by Gordon

1,853 views

Published on

A brief overview of a case study where 438 human genomes underwent read mapping and variant calling in under two months. Architectural requirements for the multi-stage pipeline are covered.

Published in: Technology, Business, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,853
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Large-scale Genomic Analysis Enabled by Gordon

  1. 1. Large-scale Genomic Analysis Enabled by Gordon! Kristopher Standish*^, Tristan M. Carland*, Glenn K. Lockwood+^, Mahidhar Tatineni+^, Wayne Pfeiffer+^, Nicholas J. Schork*^! * Scripps Translational Science Institute! + San Diego Supercomputer Center! ^ University of California San Diego! Project funding provided by Janssen R&D!
  2. 2. Background! •  Janssen R&D performed whole-genome sequencing on 438 patients undergoing treatment for rheumatoid arthritis! •  Problem: correlate response or non-response to drug therapy with genetic variants! •  Solution combines multi-disciplinary expertise! •  Genomic analytics from Janssen R&D and Scripps Translational Science Institute (STSI)! •  Data-intensive computing from San Diego Supercomputer Center (SDSC)! SAN DIEGO SUPERCOMPUTER CENTER
  3. 3. Technical Challenges! •  Data Volume: raw reads from 438 full human genomes! •  50 TB of compressed data from Janssen R&D! •  encrypted on 8x 6 TB SATA RAID enclosures! •  Compute: perform read mapping and variant calling on all genomes! •  9-step pipeline to achieve high-quality read mapping! •  5-step pipeline to do group variant calling for analysis! •  Project requirements:! •  FAST turnaround (assembly in < 2 months)! •  EFFICIENT (minimum core-hours used)! SAN DIEGO SUPERCOMPUTER CENTER
  4. 4. Read Mapping Pipeline: Looks Uniform from Traditional HPC Perspective...! Thread-level Parallelism Map (BWA) sam to bam (SAMtools) Merge Lanes (SAMtools) Sort (SAMtools) Mark Duplicates (Picard) Target Creator (GATK) Indel Realigner (GATK) Base Quality Score Recalibration (GATK) 9.  Print Reads (GATK) Walltime 1.  2.  3.  4.  5.  6.  7.  8.  Dimensions drawn to scale! SAN DIEGO SUPERCOMPUTER CENTER
  5. 5. Read Mapping Pipeline: Non-Traditional Bottlenecks (DRAM & IO)! Memory Requirement Map (BWA) sam to bam (SAMtools) Merge Lanes (SAMtools) Sort (SAMtools) Mark Duplicates (Picard) Target Creator (GATK) Indel Realigner (GATK) Base Quality Score Recalibration (GATK) 9.  Print Reads (GATK) Walltime 1.  2.  3.  4.  5.  6.  7.  8.  Dimensions drawn to scale! SAN DIEGO SUPERCOMPUTER CENTER
  6. 6. Sort Step: Bound by Disk IO and Capacity! Problem: 16 threads require...! •  25 GB DRAM! •  3.5 TB local disk! •  1.6 TB input data! which generate...! •  3,500 IOPs 
 (metadata-rich)! •  1 GB/s read rate! Solution: BigFlash! •  64 GB DRAM/node! •  16x300 GB SSDs
 (4.4 TB usable local flash)! •  1.6 GB/s from Lustre to SSDs, dedicated I/O InfiniBand rail! SAN DIEGO SUPERCOMPUTER CENTER
  7. 7. Group Variant Calling Pipeline! Walltime Thread-level Parallelism •  Massive data reduction at first step! •  Reduction in data parallelism! •  Subsequent steps (#2 - #5) offloaded to campus cluster! Dimensions approx. drawn to scale! •  1-6 threads each! •  10-30 min each! SAN DIEGO SUPERCOMPUTER CENTER
  8. 8. Footprint on Gordon: CPUs and Storage Used! 257 TB Lustre scratch used at peak ! SAN DIEGO SUPERCOMPUTER CENTER 5,000 cores (30% of Gordon) in use at once !
  9. 9. Time to Completion...! •  Overall: ! •  36 core-years of compute used in 6 weeks—equivalent to 310 cores running 24/7! •  57 TB DRAM used (aggregate)! •  Read Mapping (9-step Pipeline)! •  5 weeks including time for learning on Gordon (16 days of compute in public batch queue)! •  Over 2.5 years of 24/7 compute on a single 8-core workstation (> 4 years realistically)! •  Variant Calling (GATK Haplotype Caller)! •  5 days and 3 hours on Gordon! •  10.5 months of 24/7 compute on a 16-core workstation! SAN DIEGO SUPERCOMPUTER CENTER
  10. 10. Acknowledgements •  Chris Huang •  Ed Jaeger •  Sarah Lamberth •  Lance Smith •  Zhenya Cherkas •  Martin Dellwo •  Carrie Brodmerkel •  Sandor Szalma •  Mark Curran •  Guna Rajagopal Janssen Research & Development

×