Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU and MEGAN

780 views

Published on

Masahito Ohue, Marina Yamasawa, Kazuki Izawa, Yutaka Akiyama: Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU and MEGAN,
In Proceedings of the 19th annual IEEE International Conference on Bioinformatics and Bioengineering (IEEE BIBE 2019), 152-156, 2019. doi: 10.1109/BIBE.2019.00035

Published in: Data & Analytics
  • Positions Available Now! We currently have several openings for writing workers. ●●● http://t.cn/AieXS62G
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Here's How YOU Can Stake Out Your Personal Claim In Our EIGHT MILLION DOLLAR GOLDMINE... ▲▲▲ http://t.cn/AieXAuZz
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU and MEGAN

  1. 1. Parallelized Pipeline for Whole Genome Shotgun Metagenomics with GHOSTZ-GPU and MEGAN (DAY-3) Oct 29, 2019 B4 - Bioinformatics Session 4 (Sequence) Royal Olympic Hotel, Athens, GREECE Masahito Ohue1 Marina Yamasawa1,2 Kazuki Izawa1 Yutaka Akiyama1 1. Department of Computer Science, School of Computing, Tokyo Institute of Technology, JAPAN 2. Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL), National Institute of Advanced Industrial Science and Technology (AIST), JAPAN Paper-ID 228
  2. 2. Agenda • Introduction – Metagenome – 16S rRNA vs. whole genome shotgun (WGS) metagenomics – Homology search, GHOSTZ-GPU – WGS metagenome workflow • GHOSTMEGAN Pipeline • Computational Experiments • Results and Discussion • Conclusion 1
  3. 3. Introduction 2
  4. 4. Metagenome Analysis • Directly sequencing uncultured microbiomes obtained from target environment and analyzing the sequence data – Finding novel genes from unculturable microorganism – Elucidating composition of species/genes of environments Human body SeaGut Examples of microbiome Soil Oral 3
  5. 5. Home Microbiome Study Hospital Microbiome Project Earth Microbiome Project Marine Phage Sequencing Project National Metagenomic Project 4
  6. 6. 16S rRNA Metagenomics vs. WGS Metagenomics 5 Analyzes DNA from amplicon sequencing of prokaryotic 16S small subunit ribosomal RNA genes. 16S rRNA Sequencing ✓Provides visuals of taxonomic classification ✓Low cost × Cannot search for functional genes Analyzes the untargeted ('shotgun') sequencing of all ('meta-') microbial genomes present in a sample. Whole Genome Shotgun (WGS) Sequencing ✓Provides visuals of taxonomic classification and functional genes × More costly
  7. 7. H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 P1 P2 P3 P4 P5 P6 0% 20% 40% 60% 80% 100% (16S) Taxonomic Composition (Periodontal diseases) (Izawa K, et al. unpublished work) 6
  8. 8. H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 P1 P2 P3 P4 P5 P6 0% 20% 40% 60% 80% 100% (WGS) Functional Gene Category Composition (Izawa K, et al. unpublished work) 7
  9. 9. 16S Analysis Workflow 8Baichoo S, et al. BMC Bioinformatics, 19(1):457, 2018. (example) USEARCH mapping + Qiime summarization
  10. 10. (example) homology search + summarization WGS Metagenome Analysis Flow 9 Smith-Waterman? BLAST? Toooo Slow!! Database Escherichia coli Daphnia pulex ATGCGAAATCGCTA… CGGCTCAGCGATCG… AATCG GCACA Query ×
  11. 11. Rough Comparison of Homology Search Tools 10 BLAST Altschul 1990 BLAT Kent 2002 RAPSearch ver. 2.12 Ye 2011 Zhao 2012 DIAMOND ver. 0.7.9 Buchfink 2015 GHOSTZ Suzuki 2015 GHOSTZ-GPU Suzuki 2016 Sensitivity ✓ best × ✓ △ (fast) △ × (fast) ✓ ✓ Speed ratio (1) 50 100 1,600 (fast) 1,000 3,000 (fast) 400 1,500 (1 GPU) 2,000 (2 GPUs) 2,500 (3 GPUs) GPU △ × × × × ✓
  12. 12. GHOSTZ Algorithm BLAST GHOSTZ Database Query sequences K-mer (neighborhood words) Gapless extension Gapped extension finite automaton Seed search Results Search K-mer substring match by using finite automaton Database Query sequences Hash table Gapless extension Gapped extension Results Subsequence clustering Seed search Hash table 11 Suzuki S, Kakuta M, Ishida T, Akiyama Y. Faster sequence homology searches by clustering subsequences. Bioinformatics, 31(8), 1183–1190, 2015. Distance calculation using cluster representatives
  13. 13. Suzuki S, Kakuta M, Ishida T, Akiyama Y. GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering. PLoS ONE 11(8): e0157338, 2016. ERR315856 (Marine Microbiome Tara Oceans) against KEGG GENES DB RAPSearch GHOSTZ/GHOSTZ-GPU Sensitivity Homology search accuracy (sensitivity) Marine sample 12
  14. 14. GHOSTZ/GHOSTZ-GPU Calculation Speed 13 0 2,000 4,000 6,000 8,000 10,000 12,000 computation time (sec.) 41,236 2,644 9,970 2,794 1,885 1,502 3,717 1,034 SRR407548 (Soil) + SRS011098 (Oral) + ERR315856(Marine) against KEGG GENES DB 1,000,000 randomly selected DNA reads from each datasets. CPU: 12 CPU threads Xeon5670, 2.93GHz GPU: Tesla K20X (sec) Suzuki S, Kakuta M, Ishida T, Akiyama Y. GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering. PLoS ONE 11(8): e0157338, 2016.
  15. 15. Summarization Tool 14 Calculate the relative ratio of OTU and gene function using the output of BLAST and GHOSTZ-GPU * MEGAN itself is a pipeline tool (based on DIAMOND) Huson DH, et al. PLoS Comput Biol. (2016)
  16. 16. WGS Metagenomics Pipeline • MetaWRAP – Does not handle homology searches • Preferably performs metagenome assembly – Does not support multi-node parallelization • MEGAN – Uses DIAMOND – DIAMOND does not support GPU acceleration – Thus a high-speed analysis is not possible • MiGAP – Uses BLAST – Thus also cannot perform highspeed analysis 15 Uritskiy GV, et al. Microbiome, 6(1), 158, 2018. Huson DH, et al. PLoS Comput Biol, 12: e1004957, 2016. Sugawara H, et al. Genome Inform, 2009.
  17. 17. WGS Metagenome Analysis Flow 16 2,500 days (on normal laptop PC)BLAST Output 20-billion reads per 2-days (Illumina NovaSeq 6000) 6-hrs (on 28 cores & 4 GPUs workstation) 18-hrs (on 28 cores workstation) (Database: KEGG GENES DB, 1.3-million seqs) e.g. analysis of 100-million reads (150 bp) Further speedup is needed! ▶ multi-node parallel computing
  18. 18. Purpose of This Study • Developing new WGS metagenome analysis system, GHOSTMEGAN – Pipeline the sequence homology search and post-process – Perform distributed computation on parallel computers • Linking GHOSTZ-GPU and MEGAN – GHOSTZ-GPU is the fastest sequence homology search tool that supports multi-GPU computation • Performance evaluation – Evaluate using an actual WGS metagenome dataset by parallel execution on a multi-node GPU cluster 17
  19. 19. GHOSTMEGAN Pipeline 18
  20. 20. Overview • Simple workflow • Focused on the cluster machine (multi-GPUs x multi-nodes supercomputer) 19
  21. 21. GHOSTMEGAN Pipeline on Cluster System 20 Query (fasta file) Divide fasta fasta.1 fasta.2 fasta.n GHOSTZ- GPU GHOSTZ- GPU GHOSTZ- GPU tsv.1 tsv.2 MEGAN MEGAN MEGAN tsv.n rma.1 rma.2 rma.n Concat rma Results (rma file) … … … … … singlenode (A) Dividing query (B) Sequence homology search by GHOSTZ-GPU (C) Analyzing by MEGAN (D) Integrating results
  22. 22. 21 Query (fasta file) Divide fasta fasta.1 fasta.2 fasta.n… (A) Dividing Query • Input file (query) for WGS metagenome analysis is a huge single fasta file • The query file is divided to n files for n compute nodes • The processing time for dividing queries is extremely small compared with the other steps n nodes
  23. 23. (B) Sequence Homology Search by GHOSTZ-GPU 22 fasta.1 fasta.2 fasta.n… n nodes GHOSTZ-GPU GHOSTZ-GPU GHOSTZ-GPU • GHOSTZ-GPU is executed for individual divided query files on a node – Thread parallel computation using all CPU/GPU resources • Genome DB is stored in the local storage in all nodes • The output of GHOSTZ-GPU is a tab-delimited BLAST format file – E-value < 10-5 results are provided to the next step tsv.1 tsv.2 tsv.n…
  24. 24. (C) Analyzing by MEGAN 23 • MEGAN blast2rma command is performed (only using CPUs) • The computation is performed independently for each read sequence search result in the rma file, which will not be affected by dividing of queries n nodes tsv.1 tsv.2 tsv.n… rma.1 rma.2 rma.n… MEGAN blast2rma MEGAN blast2rma MEGAN blast2rma
  25. 25. (D) Integrating Results 24 • After all MEGAN blast2rma process, MEGAN compute-comparison command is run – integrates multiple analysis results into a single file • Then MEGAN extract-biome is used to summarize the whole results compute-comparison extract-biome MEGAN MEGAN Results (rma file) rma.1 rma.2 rma.n…
  26. 26. GHOSTMEGAN Pipeline on Cluster System 25 To ensure usability, only one parameter file needs to be edited GHOSTMEGAN pipeline
  27. 27. Experimental Settings 26
  28. 28. Hardware Specification 27 TSUBAME 3.0 compute node specification (f_node) CPU Intel Xeon E5-2680 v4 (2.4 GHz 14 cores) × 2 GPU NVIDIA Tesla P100 NVLink (16 GB) × 4 RAM 256 GiB Local storage Intel SSD DC P3500 (2 TB) Network Intel Omni-Path 100 Gb/s × 4 Job scheduler Univa Grid Engine 8.5.4C104 11 • TSUBAME 3.0 – 25th-ranked supercomputer (Top500, 8.1 Petaflops, Jun 2019) – 15,120 CPU cores – 2,160 NVIDIA P100 GPUs We performed GHOSTMEGAN with n nodes running in parallel using n of 1, 2, 4, 8, 16, 32, 64, and 128 as the query division number, respectively, and compared the execution times and speedup rates.
  29. 29. Software and Dataset 28 Homology search: GHOSTZ-GPU ver. 1.1.0 Post process: MEGAN ver. 6.12.6 $ blast2rma --in [GHOSTZ output] –out [MEGAN rma file] --format BlastTab $ ghostz-gpu aln -d [DB] -b 1 -q d -a 1 –g 3 –I [query] • Query sequences: human oral WGS metagenome reads – Duran-Pinedo AE, et al. ISME J, 8(8), 1659–1672, 2014. – The query used a random sample of 1,000,000 reads (100 bp) from periodontally healthy individual samples (145 MB) • Database: NCBI nr – 166,109,435 seqs (101 GB) – ftp://ftp.ncbi.nih.gov/blast/db/ (accessed August 18, 2018) Dataset: https://github.com/akiyamalab/ghostz-gpu http://megan.informatik.uni-tuebingen.de
  30. 30. Results and Discussion 29
  31. 31. (1) Overall Pipeline Execution Time 30 15 hours 20 min24 min 33 min • The maximum acceleration was ~45-times (on 128 nodes) • GHOSTZ-GPU was too fast, and the calculation time was saturated
  32. 32. (2) Parallel Efficiency (Scalability) 31 strong scaling = (speedup by n nodes against 1 node) / n strong scaling = 0.87 0.60 0.35 0.93 0.98 • Linear speed improvement was obtained between 1 to 32 nodes, strong scaling = 0.87
  33. 33. Summary of the Results • MEGAN scaling was good • GHOSTZ-GPU scaling decreased at n > 32 – The query data was small – Expect high efficiency for larger queries • This time it was difficult because n = 1 had to be measured • Strong scaling against n = 8 can be measured for the larger query, for example • MEGAN without GPU-implementation has room for acceleration – In order to cope with the increase queries, it is also necessary to speed up by the GPUs other than the homology search 32
  34. 34. Homology Search Results 33 compute on 2 nodes others XP 025968818.1 LOW QUALITY PROTEIN: tigger transposable element- derived protein 1-like [Dromaius novaehollandiae] XP 019376199.1 PREDICTED: tigger transposable element-derived protein 1- like, partial [Gavialis gangeticus] compute on 1 node others BAD18412.1 unnamed protein product [Homo sapiens] EHH57573.1 hypothetical protein EGM 07242, partial [Macaca fascicularis] read a read b ✓ “tigger transposable element-derived protein 1- like” gene is widely conserved ✓ The result did not affect the WGS metagenome analysis ✓ Both were annotated as function-unknown genes ✓ The result also did not affect the metagenome analysis at this time We found only two reads with different homology search results out of 1-million reads in the parallel computing of GHOSTMEGAN for the dataset
  35. 35. Conclusion 34
  36. 36. Conclusion • GHOSTMEGAN pipeline was developed and evaluated to achieve large-scale metagenomic analysis – Homology search and other process were parallelized – Executed on the TSUBAME 3.0 supercomputer with multiple GPUs • GHOSTMEGAN achieved parallel computing on multiple compute nodes – Obtained linear speedup to 32 nodes – 45-times faster calculation on 128 nodes • GPU-accelerated MEGAN or other tools will be crucial – GHOSTZ-GPU was significantly accelerated on multiple GPUs – To prepare for further increases in data size in the future 35
  37. 37. Acknowledgments 36 Funding Akiyama Lab. Tokyo Tech, JAPAN

×