SlideShare a Scribd company logo
1 of 17
Download to read offline
Optimizing High Performance Big Data
Cancer Workflows
Iván L. Jiménez Ruiz
University of Puerto Rico, Río Piedras Campus
Ricardo González Méndez
University of Puerto Rico, School of Medicine
Alexander Ropelewski
Pittsburgh Supercomputing Center
Outline
Ø  Aims
Ø  Bridges architecture
Ø  Workflow and software
Ø  Timings and performance
Ø  Recommendations for
similar workflows
Aims
Ø Implement workflows on Bridges supercomputer.
Ø Measure performance of file systems using NGS data.
Ø Determine where to run NGS programs to improve overall
workflow efficiency.
Ø Generate recommendations based on benchmarked data
8 LM Nodes
Ø  3TB RAM per node
2 ESM Nodes
Ø  12TB RAM per node
752 RSM Nodes
Ø  128G RAM per node
Bridges – Phase 1
$RAMDISK
$LOCAL
3.7 TB
$RAMDISK
$LOCAL
15 TB
$RAMDISK
$LOCAL
64 TB
Pylon2
Ø  1 PB storage
Ø  Slash2
Pylon1
Ø  1 PB storage
Ø  Lustre
Phase 1 - File systems
$LOCAL
Ø  Disk on the node
Ø  Volatile
Ø  Capacity B
/pylon1
Ø  $SCRATCH storage
Ø  Lustre
Ø  Hi-speed
Ø  Capacity C
$RAMDISK
Ø  Memory storage
Ø  Volatile
Ø  Capacity A
/pylon2
Ø  Archival storage
Ø  Distributed, high-
storage capabilities
Ø  Wide-area file system
Ø  Capacity D
http://cole-trapnell-lab.github.io/cufflinks/manual/
Reads Reads
Align to genome
TopHat (Bowtie, Bowtie2), HISAT2
Merge transcript assemblies
Cuffmerge
Find differentially expressed genes & transcripts
Cuffdiff
Condition A Condition B
Mapped
Reads
Final Transcriptome Assembly
Assembled
Transcripts
Mapped
Reads
Mapped
Reads
Mapped
Reads
Differential Expression Results
Assemble transcripts
CufflinksAssembled
Transcripts
Quality Control
Software used in workflows
Bowtie 1&2
Ø  Low memory
Ø  Dozens of cores
Ø  Demanding I/O
HISAT2
Ø  Low memory
Ø  Dozens of cores
Ø  Demanding I/O
TopHat
Ø  Low memory
Ø  Dozens of cores
Ø  Demanding I/O
FastQC
Ø  Low memory
Ø  Single core
Ø  Demanding read I/O
Cufflinks
Ø  Moderate to large
memory
Ø  Dozens of cores
Ø  Demanding I/O
Benchmark Data Sets
Two public transcriptomic datasets of glioblastomas (malignant
brain cancer) in human subjects were used:
Ø  Primary Tumor (SRR3477485)
- Size: 2.2GB, 3,792 MBases
Ø  Recurrent Tumor (SRR3477486)
- Size: 3.6GB, 6,474 MBases
Timings: Mapping – HISAT2
Time (minutes)
Partition – File system:
0 5 10 15 20 25 30 35
Recurrent Tumor (2nd Run)
Recurrent Tumor
Primary Tumor
RM-Pylon1
LM-Ram
RM-Local
RM-Pylon2
LM-Local
LM-Pylon2
Time (minutes)
Partition –
File system:
0 200 400 600 800 1000
Recurrent Tumor (2nd Run)
Recurrent Tumor
Primary Tumor
RM-Pylon1
LM-Ram
RM-Local
RM-Pylon2
LM-Local
LM-Pylon2
Timings: Mapping – TopHat (Bowtie1)
Time (minutes)
Partition – File system:
0 20 40 60 80 100 120
Recurrent Tumor
Primary Tumor
LM-Ram
RM-Local
RM-Pylon2
LM-Pylon2
Timings: Assembling – Cufflinks (TopHat, Bowtie1)
Summary: Quality Control, Aligning and Mapping
0 100 200 300 400 500 600
Tophat_Bowtie1
Tophat_Bowtie2
HiSat2
FastQC
RM-Pylon1
RM-Local
RM-Pylon2
LM-Local
LM-Pylon2
Time (minutes)
Partition –
File system:
Programs
Bridges changes since internship
Current:
Ø  /pylon 5
Ø  10 PB Pylon storage
Ø  LM Nodes: 8 4x16 cores
+ 34 4x20 cores
Ø  ESM Nodes: 2 16x18
cores + 2 16x22 cores
Original:
Ø  /pylon 1
Ø  2 PB Pylon storage
Ø  LM Nodes: 8 4x16 cores
Ø  ESM Nodes: 2 16x18
cores
Conclusions
Ø Bioinformatics workflows need to be reengineered regularly to
perform optimally on HPC systems.
Ø $LOCAL and $RAMDISK both performed comparably
- $RAMDISK had service usage charges associated
- Recommendation is to prefer $LOCAL over $RAMDISK
Ø /pylon1 performed similarly to both $LOCAL and $RAMDISK
- staging results and intermediate storage for output files.
Conclusions
Ø /pylon2 had:
- most variability
- worst performance
Our recommendation for using the file system on a similar
workflow would be to use /pylon2 for long-term storage
and archiving needs.
Acknowledgements	
University of Puerto Rico, Rio Piedras Campus
Dr. Humberto Ortiz-Zuazaga
University of Pittsburgh
Department of Biomedical Informatics
Dr. David Boone
Dr. Uma Chandran
Funding:
•  The NIH Big Data to Knowledge (BD2K) Enhancing Diversity in Biomedical Data Science Grant [9]
5R25MD010399-002 to the UPRRP
•  The National Institutes of Health Minority Access to Research Careers (MARC) grant T36- GM-095335 and
National Institutes of Health Biomedical Technology Resource grant P41-GM-103712 to the Pittsburgh
Supercomputing Center (PSC)
•  The computing resources used were provided through the Extreme Science and Engineering Discovery
Environment (XSEDE), which is supported by the National Science Foundation grant OCI-1053575.
•  The Bridges supercomputer system at the PSC was acquired through NSF Award ACI-1445606.
Questions?
goo.gl/ifrr98

More Related Content

What's hot

Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- SMP
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- SMPLec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- SMP
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- SMPHsien-Hsin Sean Lee, Ph.D.
 
Data Consistency Workshop — Oslo Cassandra Users Oct 8, 2013
Data Consistency Workshop — Oslo Cassandra Users Oct 8, 2013Data Consistency Workshop — Oslo Cassandra Users Oct 8, 2013
Data Consistency Workshop — Oslo Cassandra Users Oct 8, 2013DataStax Academy
 
Sun jdk 1.6 gc english version
Sun jdk 1.6 gc english versionSun jdk 1.6 gc english version
Sun jdk 1.6 gc english versionbluedavy lin
 
netfilter and iptables
netfilter and iptablesnetfilter and iptables
netfilter and iptablesKernel TLV
 
Java memory problem cases solutions
Java memory problem cases solutionsJava memory problem cases solutions
Java memory problem cases solutionsbluedavy lin
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3Rob Skillington
 
Understanding DPDK algorithmics
Understanding DPDK algorithmicsUnderstanding DPDK algorithmics
Understanding DPDK algorithmicsDenys Haryachyy
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...Rob Skillington
 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)elliando dias
 
PostgreSQL replication from setup to advanced features.
 PostgreSQL replication from setup to advanced features. PostgreSQL replication from setup to advanced features.
PostgreSQL replication from setup to advanced features.Pivorak MeetUp
 
Health Check Your DB2 UDB For Z/OS System
Health Check Your DB2 UDB For Z/OS SystemHealth Check Your DB2 UDB For Z/OS System
Health Check Your DB2 UDB For Z/OS Systemsjreese
 
NetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaNetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaJosef Niedermeier
 
OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017HBaseCon
 
FOSDEM2015: Live migration for containers is around the corner
FOSDEM2015: Live migration for containers is around the cornerFOSDEM2015: Live migration for containers is around the corner
FOSDEM2015: Live migration for containers is around the cornerAndrey Vagin
 

What's hot (19)

Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- SMP
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- SMPLec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- SMP
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- SMP
 
Multimaster
MultimasterMultimaster
Multimaster
 
Cram
CramCram
Cram
 
Data Consistency Workshop — Oslo Cassandra Users Oct 8, 2013
Data Consistency Workshop — Oslo Cassandra Users Oct 8, 2013Data Consistency Workshop — Oslo Cassandra Users Oct 8, 2013
Data Consistency Workshop — Oslo Cassandra Users Oct 8, 2013
 
Sun jdk 1.6 gc english version
Sun jdk 1.6 gc english versionSun jdk 1.6 gc english version
Sun jdk 1.6 gc english version
 
netfilter and iptables
netfilter and iptablesnetfilter and iptables
netfilter and iptables
 
Java memory problem cases solutions
Java memory problem cases solutionsJava memory problem cases solutions
Java memory problem cases solutions
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3
 
Progress_190315
Progress_190315Progress_190315
Progress_190315
 
Corralling Big Data at TACC
Corralling Big Data at TACCCorralling Big Data at TACC
Corralling Big Data at TACC
 
Understanding DPDK algorithmics
Understanding DPDK algorithmicsUnderstanding DPDK algorithmics
Understanding DPDK algorithmics
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)
 
PostgreSQL replication from setup to advanced features.
 PostgreSQL replication from setup to advanced features. PostgreSQL replication from setup to advanced features.
PostgreSQL replication from setup to advanced features.
 
Ns2pre
Ns2preNs2pre
Ns2pre
 
Health Check Your DB2 UDB For Z/OS System
Health Check Your DB2 UDB For Z/OS SystemHealth Check Your DB2 UDB For Z/OS System
Health Check Your DB2 UDB For Z/OS System
 
NetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaNetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and Vertica
 
OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017
 
FOSDEM2015: Live migration for containers is around the corner
FOSDEM2015: Live migration for containers is around the cornerFOSDEM2015: Live migration for containers is around the corner
FOSDEM2015: Live migration for containers is around the corner
 

Similar to Optimizing High Performance Cancer Workflows on Bridges Supercomputer

CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCoburn Watson
 
Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multipli...
Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multipli...Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multipli...
Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multipli...Ardavan Pedram
 
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- MulticoreLec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- MulticoreHsien-Hsin Sean Lee, Ph.D.
 
ASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and DeploymentsASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and DeploymentsGlenn K. Lockwood
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceSpeck&Tech
 
Atc On An Simd Cots System Wmpp05
Atc On An Simd Cots System   Wmpp05Atc On An Simd Cots System   Wmpp05
Atc On An Simd Cots System Wmpp05Ülger Ahmet
 
Preparing OpenSHMEM for Exascale
Preparing OpenSHMEM for ExascalePreparing OpenSHMEM for Exascale
Preparing OpenSHMEM for Exascaleinside-BigData.com
 
Lessons learned from shifting real data around: An ad hoc data challenge from...
Lessons learned from shifting real data around: An ad hoc data challenge from...Lessons learned from shifting real data around: An ad hoc data challenge from...
Lessons learned from shifting real data around: An ad hoc data challenge from...Jisc
 
Pipeline hazards in computer Architecture ppt
Pipeline hazards in computer Architecture pptPipeline hazards in computer Architecture ppt
Pipeline hazards in computer Architecture pptmali yogesh kumar
 
Streaming exa-scale data over 100Gbps networks
Streaming exa-scale data over 100Gbps networksStreaming exa-scale data over 100Gbps networks
Streaming exa-scale data over 100Gbps networksbalmanme
 
20141219 workshop methylation sequencing analysis
20141219 workshop methylation sequencing analysis20141219 workshop methylation sequencing analysis
20141219 workshop methylation sequencing analysisYi-Feng Chang
 
Understanding and Measuring I/O Performance
Understanding and Measuring I/O PerformanceUnderstanding and Measuring I/O Performance
Understanding and Measuring I/O PerformanceGlenn K. Lockwood
 
Programming Trends in High Performance Computing
Programming Trends in High Performance ComputingProgramming Trends in High Performance Computing
Programming Trends in High Performance ComputingJuris Vencels
 
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecasesLF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecasesLF_OpenvSwitch
 
Blue Waters and Resource Management - Now and in the Future
 Blue Waters and Resource Management - Now and in the Future Blue Waters and Resource Management - Now and in the Future
Blue Waters and Resource Management - Now and in the Futureinside-BigData.com
 
LF_DPDK17_Accelerating P4-based Dataplane with DPDK
LF_DPDK17_Accelerating P4-based Dataplane with DPDKLF_DPDK17_Accelerating P4-based Dataplane with DPDK
LF_DPDK17_Accelerating P4-based Dataplane with DPDKLF_DPDK
 
Data Replication in Distributed System
Data Replication in  Distributed SystemData Replication in  Distributed System
Data Replication in Distributed SystemEhsan Hessami
 
June 25-26, Workshop
 June 25-26,  Workshop June 25-26,  Workshop
June 25-26, WorkshopFahadahammed2
 

Similar to Optimizing High Performance Cancer Workflows on Bridges Supercomputer (20)

CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multipli...
Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multipli...Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multipli...
Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multipli...
 
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- MulticoreLec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
 
ASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and DeploymentsASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and Deployments
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for science
 
Atc On An Simd Cots System Wmpp05
Atc On An Simd Cots System   Wmpp05Atc On An Simd Cots System   Wmpp05
Atc On An Simd Cots System Wmpp05
 
Preparing OpenSHMEM for Exascale
Preparing OpenSHMEM for ExascalePreparing OpenSHMEM for Exascale
Preparing OpenSHMEM for Exascale
 
Lessons learned from shifting real data around: An ad hoc data challenge from...
Lessons learned from shifting real data around: An ad hoc data challenge from...Lessons learned from shifting real data around: An ad hoc data challenge from...
Lessons learned from shifting real data around: An ad hoc data challenge from...
 
Pipeline hazards in computer Architecture ppt
Pipeline hazards in computer Architecture pptPipeline hazards in computer Architecture ppt
Pipeline hazards in computer Architecture ppt
 
Streaming exa-scale data over 100Gbps networks
Streaming exa-scale data over 100Gbps networksStreaming exa-scale data over 100Gbps networks
Streaming exa-scale data over 100Gbps networks
 
20141219 workshop methylation sequencing analysis
20141219 workshop methylation sequencing analysis20141219 workshop methylation sequencing analysis
20141219 workshop methylation sequencing analysis
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
Understanding and Measuring I/O Performance
Understanding and Measuring I/O PerformanceUnderstanding and Measuring I/O Performance
Understanding and Measuring I/O Performance
 
Programming Trends in High Performance Computing
Programming Trends in High Performance ComputingProgramming Trends in High Performance Computing
Programming Trends in High Performance Computing
 
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecasesLF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
 
Blue Waters and Resource Management - Now and in the Future
 Blue Waters and Resource Management - Now and in the Future Blue Waters and Resource Management - Now and in the Future
Blue Waters and Resource Management - Now and in the Future
 
Refining Linux
Refining LinuxRefining Linux
Refining Linux
 
LF_DPDK17_Accelerating P4-based Dataplane with DPDK
LF_DPDK17_Accelerating P4-based Dataplane with DPDKLF_DPDK17_Accelerating P4-based Dataplane with DPDK
LF_DPDK17_Accelerating P4-based Dataplane with DPDK
 
Data Replication in Distributed System
Data Replication in  Distributed SystemData Replication in  Distributed System
Data Replication in Distributed System
 
June 25-26, Workshop
 June 25-26,  Workshop June 25-26,  Workshop
June 25-26, Workshop
 

Recently uploaded

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Optimizing High Performance Cancer Workflows on Bridges Supercomputer

  • 1. Optimizing High Performance Big Data Cancer Workflows Iván L. Jiménez Ruiz University of Puerto Rico, Río Piedras Campus Ricardo González Méndez University of Puerto Rico, School of Medicine Alexander Ropelewski Pittsburgh Supercomputing Center
  • 2. Outline Ø  Aims Ø  Bridges architecture Ø  Workflow and software Ø  Timings and performance Ø  Recommendations for similar workflows
  • 3. Aims Ø Implement workflows on Bridges supercomputer. Ø Measure performance of file systems using NGS data. Ø Determine where to run NGS programs to improve overall workflow efficiency. Ø Generate recommendations based on benchmarked data
  • 4. 8 LM Nodes Ø  3TB RAM per node 2 ESM Nodes Ø  12TB RAM per node 752 RSM Nodes Ø  128G RAM per node Bridges – Phase 1 $RAMDISK $LOCAL 3.7 TB $RAMDISK $LOCAL 15 TB $RAMDISK $LOCAL 64 TB Pylon2 Ø  1 PB storage Ø  Slash2 Pylon1 Ø  1 PB storage Ø  Lustre
  • 5. Phase 1 - File systems $LOCAL Ø  Disk on the node Ø  Volatile Ø  Capacity B /pylon1 Ø  $SCRATCH storage Ø  Lustre Ø  Hi-speed Ø  Capacity C $RAMDISK Ø  Memory storage Ø  Volatile Ø  Capacity A /pylon2 Ø  Archival storage Ø  Distributed, high- storage capabilities Ø  Wide-area file system Ø  Capacity D
  • 6. http://cole-trapnell-lab.github.io/cufflinks/manual/ Reads Reads Align to genome TopHat (Bowtie, Bowtie2), HISAT2 Merge transcript assemblies Cuffmerge Find differentially expressed genes & transcripts Cuffdiff Condition A Condition B Mapped Reads Final Transcriptome Assembly Assembled Transcripts Mapped Reads Mapped Reads Mapped Reads Differential Expression Results Assemble transcripts CufflinksAssembled Transcripts Quality Control
  • 7. Software used in workflows Bowtie 1&2 Ø  Low memory Ø  Dozens of cores Ø  Demanding I/O HISAT2 Ø  Low memory Ø  Dozens of cores Ø  Demanding I/O TopHat Ø  Low memory Ø  Dozens of cores Ø  Demanding I/O FastQC Ø  Low memory Ø  Single core Ø  Demanding read I/O Cufflinks Ø  Moderate to large memory Ø  Dozens of cores Ø  Demanding I/O
  • 8. Benchmark Data Sets Two public transcriptomic datasets of glioblastomas (malignant brain cancer) in human subjects were used: Ø  Primary Tumor (SRR3477485) - Size: 2.2GB, 3,792 MBases Ø  Recurrent Tumor (SRR3477486) - Size: 3.6GB, 6,474 MBases
  • 9. Timings: Mapping – HISAT2 Time (minutes) Partition – File system: 0 5 10 15 20 25 30 35 Recurrent Tumor (2nd Run) Recurrent Tumor Primary Tumor RM-Pylon1 LM-Ram RM-Local RM-Pylon2 LM-Local LM-Pylon2
  • 10. Time (minutes) Partition – File system: 0 200 400 600 800 1000 Recurrent Tumor (2nd Run) Recurrent Tumor Primary Tumor RM-Pylon1 LM-Ram RM-Local RM-Pylon2 LM-Local LM-Pylon2 Timings: Mapping – TopHat (Bowtie1)
  • 11. Time (minutes) Partition – File system: 0 20 40 60 80 100 120 Recurrent Tumor Primary Tumor LM-Ram RM-Local RM-Pylon2 LM-Pylon2 Timings: Assembling – Cufflinks (TopHat, Bowtie1)
  • 12. Summary: Quality Control, Aligning and Mapping 0 100 200 300 400 500 600 Tophat_Bowtie1 Tophat_Bowtie2 HiSat2 FastQC RM-Pylon1 RM-Local RM-Pylon2 LM-Local LM-Pylon2 Time (minutes) Partition – File system: Programs
  • 13. Bridges changes since internship Current: Ø  /pylon 5 Ø  10 PB Pylon storage Ø  LM Nodes: 8 4x16 cores + 34 4x20 cores Ø  ESM Nodes: 2 16x18 cores + 2 16x22 cores Original: Ø  /pylon 1 Ø  2 PB Pylon storage Ø  LM Nodes: 8 4x16 cores Ø  ESM Nodes: 2 16x18 cores
  • 14. Conclusions Ø Bioinformatics workflows need to be reengineered regularly to perform optimally on HPC systems. Ø $LOCAL and $RAMDISK both performed comparably - $RAMDISK had service usage charges associated - Recommendation is to prefer $LOCAL over $RAMDISK Ø /pylon1 performed similarly to both $LOCAL and $RAMDISK - staging results and intermediate storage for output files.
  • 15. Conclusions Ø /pylon2 had: - most variability - worst performance Our recommendation for using the file system on a similar workflow would be to use /pylon2 for long-term storage and archiving needs.
  • 16. Acknowledgements University of Puerto Rico, Rio Piedras Campus Dr. Humberto Ortiz-Zuazaga University of Pittsburgh Department of Biomedical Informatics Dr. David Boone Dr. Uma Chandran Funding: •  The NIH Big Data to Knowledge (BD2K) Enhancing Diversity in Biomedical Data Science Grant [9] 5R25MD010399-002 to the UPRRP •  The National Institutes of Health Minority Access to Research Careers (MARC) grant T36- GM-095335 and National Institutes of Health Biomedical Technology Resource grant P41-GM-103712 to the Pittsburgh Supercomputing Center (PSC) •  The computing resources used were provided through the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by the National Science Foundation grant OCI-1053575. •  The Bridges supercomputer system at the PSC was acquired through NSF Award ACI-1445606.