Sonagachi Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Pitt's Barriers to Centralizing Clinical NGS Data Analysis
1. Department Name (View Master > Edit Slide 1)
Computing
Infrastructures for
Clinical NGS
Barriers to centralizing data analysis at the
University of Pittsburgh
M. Michael Barmada
Department of Human Genetics
Graduate School of Public Health, University of Pittsburgh
2. University of Pittsburgh
• Geographically disperse campus (132 acres in Pittsburgh, plus
regional campuses)
• Large affiliated hospital system (UPMC - 23 hospitals spread
out over tri-state area)
• Has been ranked in the top cluster of research institutions in
the US
• 7th in nation in terms of funding from NIH
M. Michael Barmada
Department of Human Genetics
Graduate School of Public Health, University of Pittsburgh
3. NGS at Pitt
M. Michael Barmada
Department of Human Genetics
Graduate School of Public Health, University of Pittsburgh
4. Common Analysis hurdles for NGS
•Hardware
•Computing capacity
•Network
•Storage
•Software
M. Michael Barmada
Department of Human Genetics
Graduate School of Public Health, University of Pittsburgh
5. Common problems - (1) Hardware
• When NGS machines started appearing on campus, there
were 14 “high-performance” computing centers
• Most were small group- or department- specific clusters
(<100 cores) with limited storage and standard (GigE)
networking
• Larger computing resources were available at the Pittsburgh
Supercomputing Center, but with limited availability
M. Michael Barmada
Department of Human Genetics
Graduate School of Public Health, University of Pittsburgh
6. Center for Simulation and Modeling (SAM)
• One large (>3000 cores) cluster existed on campus -
established by computational chemistry and engineering
groups
• Large capacity machines (>12 cores/48Gb RAM per node -
many with 48 cores/128-256Gb RAM)
• This cleared up the RAM and capacity problems
M. Michael Barmada
Department of Human Genetics
Graduate School of Public Health, University of Pittsburgh
7. Common Problems - (2) Storage
• Despite early successes with SAM cluster, problems started
appearing as number of users went up
• Storage array - SAM cluster uses a shared NFS array for /
home - reading and writing large files (read/quality files)
became a serious bottleneck
• Upgraded array to high-performance system (Panasas) -
allowed for parallel (DirectFS/pNFS) access, greater
throughput than standard RAID
M. Michael Barmada
Department of Human Genetics
Graduate School of Public Health, University of Pittsburgh
8. Common Problems - (3) Networking
• Networking within the SAM cluster is a combination of
infiniband and gigabit ethernet - not a problem
• Networking on campus was a problem
• Old network segments (100Mb), Firewalls (multiple hops)
- maximum transfer speeds only 10-15Mbit
• Upgrades “in progress”
M. Michael Barmada
Department of Human Genetics
Graduate School of Public Health, University of Pittsburgh
9. Common Problems - (3) Networking
• Solutions
• “Sneaker-net” - works, but leads to proliferation of drives,
potential for data loss/corruption
• Globus/GridFTP - faster than campus network (transfer
speeds of 1-2Gbit)
• Cloud-based services (SevenBridges) - surprisingly
economical and efficient - sequencing centers upload data
for individual groups, who then use the data for analysis
(online or at local cluster) and for backup (desktops)
M. Michael Barmada
Department of Human Genetics
Graduate School of Public Health, University of Pittsburgh
10. Common Problems - (4) Software
• Pipelines created for linking together common tools (BWA/
NovoAlign/GATK/Annovar) - but these require familiarity with
command line/unix environment
• With increasing use of NGS by medical/clinical research
groups, we had more and more users who were not
comfortable in a unix environment
• Solutions: train users or develop non-unix-based interfaces
M. Michael Barmada
Department of Human Genetics
Graduate School of Public Health, University of Pittsburgh
11. Research Gateways
• Several Bioinformatics/NGS gateways are in the process of
being implemented
• Each allows access to the computational resources of the SAM
cluster using web-based or client-based interfaces
M. Michael Barmada
Department of Human Genetics
Graduate School of Public Health, University of Pittsburgh
12. Galaxy
M. Michael Barmada
Department of Human Genetics
Graduate School of Public Health, University of Pittsburgh
13. CLCbio Genomic Server
M. Michael Barmada
Department of Human Genetics
Graduate School of Public Health, University of Pittsburgh
14. CLCbio Genomic Server
CLCbio Genomics
Workbench
CLCbio Genomics
Server
M. Michael Barmada
Department of Human Genetics
Graduate School of Public Health, University of Pittsburgh
15. Genboree
M. Michael Barmada
Department of Human Genetics
Graduate School of Public Health, University of Pittsburgh
16. M. Michael Barmada
Department of Human Genetics
Graduate School of Public Health, University of Pittsburgh
17. Issues with research gateways
• Common data storage and data dedup
• A current focus is configuring all NGS gateways so that they
can all share the same common storage space and files, so we
do not need to duplicate data cross multiple storage spaces
• CLC bio Genomics Server “plays well with others”, as does
Yabi, but Galaxy and Genboree do a lot of file permission
modifications
• Solution: create a meta-data store that ensures files are
owned by the appropriate users and have appropriate
permissions - coupled with cron tasks to monitor user/
permission changes
M. Michael Barmada
Department of Human Genetics
Graduate School of Public Health, University of Pittsburgh
18. Cloud Computing
• Another alternative: Cloud computing/hybrid solutions
(“Cloud-bursting”)
• Currently setting up cloud-based storage/staging of NGS
data to circumvent networking issues on campus
• Natural extension to allow users to analyze data “in the
cloud”
• Similar offerings from several companies - we’re working
with SevenBridges Genomics - nice billing interface for each
individual use (storage/staging/analysis)
M. Michael Barmada
Department of Human Genetics
Graduate School of Public Health, University of Pittsburgh
19. Seven Bridges Genomics
M. Michael Barmada
Department of Human Genetics
Graduate School of Public Health, University of Pittsburgh
20. NGS clinical process
Patient/Family UPMC Pitt
consented
Samples DNA/
Sequencing
drawn library prep
Medical QC/filtering
Report Alignment
Analysis
Variant Calling
Interpretation Validation Annotation
EMR
Data Storage Data Storage
(BAM, VCF files) (Raw, BAM, VCF)
M. Michael Barmada
Department of Human Genetics
Graduate School of Public Health, University of Pittsburgh
21. The last analysis challenge
• Even after fixing all of these issues, two major hurdles remain
• Community
• Organizing and coordinating all NGS efforts on campus
would greatly speed up the pace of research
• Education!
• We need to educate clinicians and clinical-support staff
(genetic counselors) to understand the limitations and
the advantages of sequence data from the perspective of
clinical utility
M. Michael Barmada
Department of Human Genetics
Graduate School of Public Health, University of Pittsburgh
22. Thanks!
M. Michael Barmada
Department of Human Genetics
Graduate School of Public Health, University of Pittsburgh
Editor's Notes
Instructions for editing school and department titles:\n\n Select from menu: View > Master > Slide Master\n\n Click on each text area you wish to edit. Text will become editable.\n
Some quick facts about the University of Pittsburgh - we are very spread out, as is common in large academic/medical centers. UPMC in particular is one of the largest hospital systems in the US, serving a population of approximately 15 million people in what we call the tri-state area, with major emphasis in GI, transplant, cancer, and aging centers. Given the large amount of research funding brought in by university and upmc researchers, you can imagine we are a very active research center.\n
This map shows the location of NGS machines at Pitt that have shown up in the last three years - a total of 11 machines of all types (454, illumina, solid, ion torrent). The lack of centralization of these resources has hurt efforts to get good (reproducible) data from these centers.\n