Pitt's Barriers to Centralizing Clinical NGS Data Analysis

Department Name (View Master > Edit Slide 1)

Computing
Infrastructures for
Clinical NGS
Barriers to centralizing data analysis at the
University of Pittsburgh

M. Michael Barmada
Department of Human Genetics
Graduate School of Public Health, University of Pittsburgh

University of Pittsburgh

• Geographically disperse campus (132 acres in Pittsburgh, plus
regional campuses)
• Large afﬁliated hospital system (UPMC - 23 hospitals spread
out over tri-state area)
• Has been ranked in the top cluster of research institutions in
the US
• 7th in nation in terms of funding from NIH

M. Michael Barmada

NGS at Pitt

M. Michael Barmada

Common Analysis hurdles for NGS

•Hardware
•Computing capacity
•Network
•Storage
•Software

M. Michael Barmada

Common problems - (1) Hardware

• When NGS machines started appearing on campus, there
were 14 “high-performance” computing centers
• Most were small group- or department- speciﬁc clusters
(<100 cores) with limited storage and standard (GigE)
networking
• Larger computing resources were available at the Pittsburgh
Supercomputing Center, but with limited availability

M. Michael Barmada

Center for Simulation and Modeling (SAM)

• One large (>3000 cores) cluster existed on campus -
established by computational chemistry and engineering
groups
• Large capacity machines (>12 cores/48Gb RAM per node -
many with 48 cores/128-256Gb RAM)
• This cleared up the RAM and capacity problems

M. Michael Barmada

Common Problems - (2) Storage

• Despite early successes with SAM cluster, problems started
appearing as number of users went up
• Storage array - SAM cluster uses a shared NFS array for /
home - reading and writing large ﬁles (read/quality ﬁles)
became a serious bottleneck
• Upgraded array to high-performance system (Panasas) -
allowed for parallel (DirectFS/pNFS) access, greater
throughput than standard RAID

M. Michael Barmada

Common Problems - (3) Networking

• Networking within the SAM cluster is a combination of
inﬁniband and gigabit ethernet - not a problem
• Networking on campus was a problem
• Old network segments (100Mb), Firewalls (multiple hops)
- maximum transfer speeds only 10-15Mbit
• Upgrades “in progress”

M. Michael Barmada

Common Problems - (3) Networking

• Solutions
• “Sneaker-net” - works, but leads to proliferation of drives,
potential for data loss/corruption
• Globus/GridFTP - faster than campus network (transfer
speeds of 1-2Gbit)

• Cloud-based services (SevenBridges) - surprisingly
economical and efﬁcient - sequencing centers upload data
for individual groups, who then use the data for analysis
(online or at local cluster) and for backup (desktops)

M. Michael Barmada

Common Problems - (4) Software

• Pipelines created for linking together common tools (BWA/
NovoAlign/GATK/Annovar) - but these require familiarity with
command line/unix environment
• With increasing use of NGS by medical/clinical research
groups, we had more and more users who were not
comfortable in a unix environment
• Solutions: train users or develop non-unix-based interfaces

M. Michael Barmada

Research Gateways

• Several Bioinformatics/NGS gateways are in the process of
being implemented
• Each allows access to the computational resources of the SAM
cluster using web-based or client-based interfaces

M. Michael Barmada

Galaxy

M. Michael Barmada

CLCbio Genomic Server

M. Michael Barmada

CLCbio Genomic Server

CLCbio Genomics
Workbench

CLCbio Genomics
Server

M. Michael Barmada

Genboree

M. Michael Barmada

M. Michael Barmada

Issues with research gateways

• Common data storage and data dedup
• A current focus is configuring all NGS gateways so that they
can all share the same common storage space and files, so we
do not need to duplicate data cross multiple storage spaces
• CLC bio Genomics Server “plays well with others”, as does
Yabi, but Galaxy and Genboree do a lot of file permission
modifications
• Solution: create a meta-data store that ensures files are
owned by the appropriate users and have appropriate
permissions - coupled with cron tasks to monitor user/
permission changes
M. Michael Barmada

Cloud Computing

• Another alternative: Cloud computing/hybrid solutions
(“Cloud-bursting”)
• Currently setting up cloud-based storage/staging of NGS
data to circumvent networking issues on campus
• Natural extension to allow users to analyze data “in the
cloud”
• Similar offerings from several companies - we’re working
with SevenBridges Genomics - nice billing interface for each
individual use (storage/staging/analysis)

M. Michael Barmada

Seven Bridges Genomics

M. Michael Barmada

NGS clinical process
Patient/Family UPMC Pitt
consented

Samples DNA/
Sequencing
drawn library prep

Medical QC/ﬁltering
Report Alignment
Analysis
Variant Calling
Interpretation Validation Annotation

EMR

Data Storage Data Storage
(BAM, VCF ﬁles) (Raw, BAM, VCF)

M. Michael Barmada

The last analysis challenge

• Even after ﬁxing all of these issues, two major hurdles remain
• Community
• Organizing and coordinating all NGS efforts on campus
would greatly speed up the pace of research
• Education!
• We need to educate clinicians and clinical-support staff
(genetic counselors) to understand the limitations and
the advantages of sequence data from the perspective of
clinical utility

M. Michael Barmada

Thanks!

M. Michael Barmada

Pitt's Barriers to Centralizing Clinical NGS Data Analysis

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (14)

Similar to Pitt's Barriers to Centralizing Clinical NGS Data Analysis

Similar to Pitt's Barriers to Centralizing Clinical NGS Data Analysis (20)

More from Copenhagenomics

More from Copenhagenomics (15)

Recently uploaded

Recently uploaded (20)

Pitt's Barriers to Centralizing Clinical NGS Data Analysis

Editor's Notes