• Save
Large-scale Genomic Analysis Enabled by Gordon
Upcoming SlideShare
Loading in...5
×
 

Large-scale Genomic Analysis Enabled by Gordon

on

  • 565 views

A brief overview of a case study where 438 human genomes underwent read mapping and variant calling in under two months. Architectural requirements for the multi-stage pipeline are covered.

A brief overview of a case study where 438 human genomes underwent read mapping and variant calling in under two months. Architectural requirements for the multi-stage pipeline are covered.

Statistics

Views

Total Views
565
Views on SlideShare
562
Embed Views
3

Actions

Likes
1
Downloads
0
Comments
0

1 Embed 3

https://twitter.com 3

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Large-scale Genomic Analysis Enabled by Gordon Large-scale Genomic Analysis Enabled by Gordon Presentation Transcript

    • Large-scale Genomic Analysis Enabled by Gordon! Kristopher Standish*^, Tristan M. Carland*, Glenn K. Lockwood+^, Mahidhar Tatineni+^, Wayne Pfeiffer+^, Nicholas J. Schork*^! * Scripps Translational Science Institute! + San Diego Supercomputer Center! ^ University of California San Diego! Project funding provided by Janssen R&D!
    • Background! •  Janssen R&D performed whole-genome sequencing on 438 patients undergoing treatment for rheumatoid arthritis! •  Problem: correlate response or non-response to drug therapy with genetic variants! •  Solution combines multi-disciplinary expertise! •  Genomic analytics from Janssen R&D and Scripps Translational Science Institute (STSI)! •  Data-intensive computing from San Diego Supercomputer Center (SDSC)! SAN DIEGO SUPERCOMPUTER CENTER
    • Technical Challenges! •  Data Volume: raw reads from 438 full human genomes! •  50 TB of compressed data from Janssen R&D! •  encrypted on 8x 6 TB SATA RAID enclosures! •  Compute: perform read mapping and variant calling on all genomes! •  9-step pipeline to achieve high-quality read mapping! •  5-step pipeline to do group variant calling for analysis! •  Project requirements:! •  FAST turnaround (assembly in < 2 months)! •  EFFICIENT (minimum core-hours used)! SAN DIEGO SUPERCOMPUTER CENTER
    • Read Mapping Pipeline: Looks Uniform from Traditional HPC Perspective...! Thread-level Parallelism Map (BWA) sam to bam (SAMtools) Merge Lanes (SAMtools) Sort (SAMtools) Mark Duplicates (Picard) Target Creator (GATK) Indel Realigner (GATK) Base Quality Score Recalibration (GATK) 9.  Print Reads (GATK) Walltime 1.  2.  3.  4.  5.  6.  7.  8.  Dimensions drawn to scale! SAN DIEGO SUPERCOMPUTER CENTER
    • Read Mapping Pipeline: Non-Traditional Bottlenecks (DRAM & IO)! Memory Requirement Map (BWA) sam to bam (SAMtools) Merge Lanes (SAMtools) Sort (SAMtools) Mark Duplicates (Picard) Target Creator (GATK) Indel Realigner (GATK) Base Quality Score Recalibration (GATK) 9.  Print Reads (GATK) Walltime 1.  2.  3.  4.  5.  6.  7.  8.  Dimensions drawn to scale! SAN DIEGO SUPERCOMPUTER CENTER
    • Sort Step: Bound by Disk IO and Capacity! Problem: 16 threads require...! •  25 GB DRAM! •  3.5 TB local disk! •  1.6 TB input data! which generate...! •  3,500 IOPs 
 (metadata-rich)! •  1 GB/s read rate! Solution: BigFlash! •  64 GB DRAM/node! •  16x300 GB SSDs
 (4.4 TB usable local flash)! •  1.6 GB/s from Lustre to SSDs, dedicated I/O InfiniBand rail! SAN DIEGO SUPERCOMPUTER CENTER
    • Group Variant Calling Pipeline! Walltime Thread-level Parallelism •  Massive data reduction at first step! •  Reduction in data parallelism! •  Subsequent steps (#2 - #5) offloaded to campus cluster! Dimensions approx. drawn to scale! •  1-6 threads each! •  10-30 min each! SAN DIEGO SUPERCOMPUTER CENTER
    • Footprint on Gordon: CPUs and Storage Used! 257 TB Lustre scratch used at peak ! SAN DIEGO SUPERCOMPUTER CENTER 5,000 cores (30% of Gordon) in use at once !
    • Time to Completion...! •  Overall: ! •  36 core-years of compute used in 6 weeks—equivalent to 310 cores running 24/7! •  57 TB DRAM used (aggregate)! •  Read Mapping (9-step Pipeline)! •  5 weeks including time for learning on Gordon (16 days of compute in public batch queue)! •  Over 2.5 years of 24/7 compute on a single 8-core workstation (> 4 years realistically)! •  Variant Calling (GATK Haplotype Caller)! •  5 days and 3 hours on Gordon! •  10.5 months of 24/7 compute on a 16-core workstation! SAN DIEGO SUPERCOMPUTER CENTER
    • Acknowledgements •  Chris Huang •  Ed Jaeger •  Sarah Lamberth •  Lance Smith •  Zhenya Cherkas •  Martin Dellwo •  Carrie Brodmerkel •  Sandor Szalma •  Mark Curran •  Guna Rajagopal Janssen Research & Development