SlideShare a Scribd company logo
1 of 37
Parallelized Pipeline
for Whole Genome Shotgun Metagenomics
with GHOSTZ-GPU and MEGAN
(DAY-3) Oct 29, 2019
B4 - Bioinformatics Session 4 (Sequence)
Royal Olympic Hotel, Athens, GREECE
Masahito Ohue1 Marina Yamasawa1,2 Kazuki Izawa1 Yutaka Akiyama1
1. Department of Computer Science, School of Computing,
Tokyo Institute of Technology, JAPAN
2. Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL),
National Institute of Advanced Industrial Science and Technology (AIST), JAPAN
Paper-ID 228
Agenda
• Introduction
– Metagenome
– 16S rRNA vs. whole genome shotgun (WGS) metagenomics
– Homology search, GHOSTZ-GPU
– WGS metagenome workflow
• GHOSTMEGAN Pipeline
• Computational Experiments
• Results and Discussion
• Conclusion
1
Introduction
2
Metagenome Analysis
• Directly sequencing uncultured microbiomes
obtained from target environment and analyzing the
sequence data
– Finding novel genes from unculturable microorganism
– Elucidating composition of species/genes of environments
Human
body
SeaGut
Examples of microbiome
Soil
Oral
3
Home Microbiome Study Hospital Microbiome Project
Earth Microbiome Project Marine Phage Sequencing Project
National Metagenomic Project
4
16S rRNA Metagenomics vs. WGS Metagenomics
5
Analyzes DNA from amplicon
sequencing of prokaryotic 16S small
subunit ribosomal RNA genes.
16S rRNA Sequencing
✓Provides visuals of taxonomic
classification
✓Low cost
× Cannot search for functional
genes
Analyzes the untargeted ('shotgun')
sequencing of all ('meta-') microbial
genomes present in a sample.
Whole Genome Shotgun
(WGS) Sequencing
✓Provides visuals of taxonomic
classification and functional
genes
× More costly
H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 P1 P2 P3 P4 P5 P6
0%
20%
40%
60%
80%
100%
(16S) Taxonomic Composition (Periodontal diseases)
(Izawa K, et al. unpublished work)
6
H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 P1 P2 P3 P4 P5 P6
0%
20%
40%
60%
80%
100%
(WGS) Functional Gene Category Composition
(Izawa K, et al. unpublished work)
7
16S Analysis Workflow
8Baichoo S, et al. BMC Bioinformatics, 19(1):457, 2018.
(example) USEARCH mapping + Qiime summarization
(example) homology search + summarization
WGS Metagenome Analysis Flow
9
Smith-Waterman?
BLAST?
Toooo Slow!!
Database
Escherichia coli
Daphnia pulex
ATGCGAAATCGCTA…
CGGCTCAGCGATCG…
AATCG
GCACA
Query
×
Rough Comparison of Homology Search Tools
10
BLAST
Altschul
1990
BLAT
Kent
2002
RAPSearch
ver. 2.12
Ye 2011
Zhao 2012
DIAMOND
ver. 0.7.9
Buchfink
2015
GHOSTZ
Suzuki
2015
GHOSTZ-GPU
Suzuki
2016
Sensitivity ✓
best
× ✓
△ (fast)
△
× (fast) ✓ ✓
Speed ratio (1) 50 100
1,600 (fast)
1,000
3,000 (fast)
400
1,500 (1 GPU)
2,000 (2 GPUs)
2,500 (3 GPUs)
GPU △ × × × × ✓
GHOSTZ Algorithm
BLAST GHOSTZ
Database
Query sequences
K-mer
(neighborhood words)
Gapless
extension
Gapped
extension
finite
automaton
Seed
search
Results
Search K-mer substring
match by using finite
automaton
Database
Query sequences
Hash table
Gapless
extension
Gapped
extension
Results
Subsequence
clustering
Seed
search
Hash table
11
Suzuki S, Kakuta M, Ishida T, Akiyama Y. Faster sequence homology searches by
clustering subsequences. Bioinformatics, 31(8), 1183–1190, 2015.
Distance calculation
using cluster
representatives
Suzuki S, Kakuta M, Ishida T, Akiyama Y. GPU-Acceleration of Sequence Homology
Searches with Database Subsequence Clustering. PLoS ONE 11(8): e0157338, 2016.
ERR315856
(Marine
Microbiome
Tara Oceans)
against
KEGG GENES DB
RAPSearch
GHOSTZ/GHOSTZ-GPU Sensitivity
Homology search accuracy (sensitivity)
Marine sample
12
GHOSTZ/GHOSTZ-GPU Calculation Speed
13
0
2,000
4,000
6,000
8,000
10,000
12,000
computation time (sec.)
41,236
2,644
9,970
2,794
1,885 1,502
3,717
1,034
SRR407548 (Soil) +
SRS011098 (Oral) +
ERR315856(Marine)
against KEGG GENES DB
1,000,000 randomly
selected DNA reads
from each datasets.
CPU: 12 CPU threads
Xeon5670, 2.93GHz
GPU: Tesla K20X
(sec)
Suzuki S, Kakuta M, Ishida T, Akiyama Y. GPU-Acceleration of Sequence Homology
Searches with Database Subsequence Clustering. PLoS ONE 11(8): e0157338, 2016.
Summarization Tool
14
Calculate the relative ratio of OTU and gene function using the output
of BLAST and GHOSTZ-GPU
* MEGAN itself is a pipeline tool (based on DIAMOND)
Huson DH, et al. PLoS Comput Biol. (2016)
WGS Metagenomics Pipeline
• MetaWRAP
– Does not handle homology searches
• Preferably performs metagenome assembly
– Does not support multi-node parallelization
• MEGAN
– Uses DIAMOND
– DIAMOND does not support GPU acceleration
– Thus a high-speed analysis is not possible
• MiGAP
– Uses BLAST
– Thus also cannot perform highspeed analysis
15
Uritskiy GV, et al. Microbiome, 6(1), 158, 2018.
Huson DH, et al. PLoS Comput Biol, 12: e1004957, 2016.
Sugawara H, et al. Genome Inform, 2009.
WGS Metagenome Analysis Flow
16
2,500 days (on normal laptop PC)BLAST
Output 20-billion reads per 2-days
(Illumina NovaSeq 6000)
6-hrs (on 28 cores & 4 GPUs workstation)
18-hrs (on 28 cores workstation)
(Database: KEGG GENES DB, 1.3-million seqs)
e.g. analysis of 100-million reads (150 bp)
Further speedup is needed!
▶ multi-node parallel computing
Purpose of This Study
• Developing new WGS metagenome analysis system,
GHOSTMEGAN
– Pipeline the sequence homology search and post-process
– Perform distributed computation on parallel computers
• Linking GHOSTZ-GPU and MEGAN
– GHOSTZ-GPU is the fastest sequence homology search
tool that supports multi-GPU computation
• Performance evaluation
– Evaluate using an actual WGS metagenome dataset
by parallel execution on a multi-node GPU cluster
17
GHOSTMEGAN Pipeline
18
Overview
• Simple workflow
• Focused on the cluster machine
(multi-GPUs x multi-nodes supercomputer)
19
GHOSTMEGAN Pipeline on Cluster System
20
Query
(fasta file)
Divide fasta
fasta.1 fasta.2 fasta.n
GHOSTZ-
GPU
GHOSTZ-
GPU
GHOSTZ-
GPU
tsv.1 tsv.2
MEGAN MEGAN MEGAN
tsv.n
rma.1 rma.2 rma.n
Concat rma
Results
(rma file)
…
…
…
…
…
singlenode
(A) Dividing query
(B) Sequence homology
search by GHOSTZ-GPU
(C) Analyzing by MEGAN
(D) Integrating results
21
Query
(fasta file)
Divide fasta
fasta.1 fasta.2 fasta.n…
(A) Dividing Query
• Input file (query) for WGS metagenome analysis is a huge single fasta file
• The query file is divided to n files for n compute nodes
• The processing time for dividing queries is extremely small compared
with the other steps
n nodes
(B) Sequence Homology Search by GHOSTZ-GPU
22
fasta.1 fasta.2 fasta.n…
n nodes
GHOSTZ-GPU GHOSTZ-GPU GHOSTZ-GPU
• GHOSTZ-GPU is executed for individual divided query files on a node
– Thread parallel computation using all CPU/GPU resources
• Genome DB is stored in the local storage in all nodes
• The output of GHOSTZ-GPU is a tab-delimited BLAST format file
– E-value < 10-5 results are provided to the next step
tsv.1 tsv.2 tsv.n…
(C) Analyzing by MEGAN
23
• MEGAN blast2rma command is performed (only using CPUs)
• The computation is performed independently for each read sequence
search result in the rma file, which will not be affected by dividing of
queries
n nodes
tsv.1 tsv.2 tsv.n…
rma.1 rma.2 rma.n…
MEGAN
blast2rma
MEGAN
blast2rma
MEGAN
blast2rma
(D) Integrating Results
24
• After all MEGAN blast2rma process, MEGAN compute-comparison
command is run
– integrates multiple analysis results into a single file
• Then MEGAN extract-biome is used to summarize the whole results
compute-comparison
extract-biome
MEGAN
MEGAN
Results
(rma file)
rma.1 rma.2 rma.n…
GHOSTMEGAN Pipeline on Cluster System
25
To ensure usability, only one parameter file needs to be edited
GHOSTMEGAN pipeline
Experimental Settings
26
Hardware Specification
27
TSUBAME 3.0 compute node specification (f_node)
CPU Intel Xeon E5-2680 v4 (2.4 GHz 14 cores) × 2
GPU NVIDIA Tesla P100 NVLink (16 GB) × 4
RAM 256 GiB
Local storage Intel SSD DC P3500 (2 TB)
Network Intel Omni-Path 100 Gb/s × 4
Job scheduler Univa Grid Engine 8.5.4C104 11
• TSUBAME 3.0
– 25th-ranked supercomputer
(Top500, 8.1 Petaflops, Jun 2019)
– 15,120 CPU cores
– 2,160 NVIDIA P100 GPUs
We performed GHOSTMEGAN with n nodes running in parallel using n of 1, 2, 4, 8, 16,
32, 64, and 128 as the query division number, respectively, and compared the execution
times and speedup rates.
Software and Dataset
28
Homology search: GHOSTZ-GPU ver. 1.1.0
Post process: MEGAN ver. 6.12.6
$ blast2rma --in [GHOSTZ output] –out [MEGAN rma file]
--format BlastTab
$ ghostz-gpu aln -d [DB] -b 1 -q d -a 1 –g 3 –I [query]
• Query sequences: human oral WGS metagenome reads
– Duran-Pinedo AE, et al. ISME J, 8(8), 1659–1672, 2014.
– The query used a random sample of 1,000,000 reads (100 bp)
from periodontally healthy individual samples (145 MB)
• Database: NCBI nr
– 166,109,435 seqs (101 GB)
– ftp://ftp.ncbi.nih.gov/blast/db/ (accessed August 18, 2018)
Dataset:
https://github.com/akiyamalab/ghostz-gpu
http://megan.informatik.uni-tuebingen.de
Results and Discussion
29
(1) Overall Pipeline Execution Time
30
15 hours
20 min24 min
33 min
• The maximum acceleration was ~45-times (on 128 nodes)
• GHOSTZ-GPU was too fast, and the calculation time was saturated
(2) Parallel Efficiency (Scalability)
31
strong scaling = (speedup by n nodes against 1 node) / n
strong scaling = 0.87
0.60 0.35
0.93
0.98
• Linear speed improvement was obtained between 1 to 32 nodes,
strong scaling = 0.87
Summary of the Results
• MEGAN scaling was good
• GHOSTZ-GPU scaling decreased at n > 32
– The query data was small
– Expect high efficiency for larger queries
• This time it was difficult because n = 1 had to be measured
• Strong scaling against n = 8 can be measured for the larger
query, for example
• MEGAN without GPU-implementation has room for
acceleration
– In order to cope with the increase queries, it is also
necessary to speed up by the GPUs other than the
homology search
32
Homology Search Results
33
compute
on 2 nodes
others
XP 025968818.1 LOW QUALITY
PROTEIN: tigger
transposable element-
derived protein 1-like
[Dromaius novaehollandiae]
XP 019376199.1 PREDICTED:
tigger transposable
element-derived protein 1-
like, partial
[Gavialis gangeticus]
compute
on 1 node
others
BAD18412.1 unnamed protein
product
[Homo sapiens]
EHH57573.1 hypothetical
protein EGM 07242, partial
[Macaca fascicularis]
read a
read b
✓ “tigger transposable
element-derived protein 1-
like” gene is widely
conserved
✓ The result did not affect
the WGS metagenome
analysis
✓ Both were annotated as
function-unknown genes
✓ The result also did not
affect the metagenome
analysis at this time
We found only two reads with different homology search results out of 1-million
reads in the parallel computing of GHOSTMEGAN for the dataset
Conclusion
34
Conclusion
• GHOSTMEGAN pipeline was developed and evaluated
to achieve large-scale metagenomic analysis
– Homology search and other process were parallelized
– Executed on the TSUBAME 3.0 supercomputer with multiple GPUs
• GHOSTMEGAN achieved parallel computing on multiple
compute nodes
– Obtained linear speedup to 32 nodes
– 45-times faster calculation on 128 nodes
• GPU-accelerated MEGAN or other tools will be crucial
– GHOSTZ-GPU was significantly accelerated on multiple GPUs
– To prepare for further increases in data size in the future
35
Acknowledgments
36
Funding
Akiyama Lab. Tokyo Tech, JAPAN

More Related Content

What's hot

What's hot (20)

20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw
 
20201128_OSC_Fukuoka_Online_GPUPostGIS
20201128_OSC_Fukuoka_Online_GPUPostGIS20201128_OSC_Fukuoka_Online_GPUPostGIS
20201128_OSC_Fukuoka_Online_GPUPostGIS
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)
 
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
 
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
 
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC PlatformsProtecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
 
PFQ@ 10th Italian Networking Workshop (Bormio)
PFQ@ 10th Italian Networking Workshop (Bormio)PFQ@ 10th Italian Networking Workshop (Bormio)
PFQ@ 10th Italian Networking Workshop (Bormio)
 
PG-Strom
PG-StromPG-Strom
PG-Strom
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
 
XDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @CloudflareXDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @Cloudflare
 
eBPF Basics
eBPF BasicseBPF Basics
eBPF Basics
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
Prashant de-ny-project-s1
Prashant de-ny-project-s1Prashant de-ny-project-s1
Prashant de-ny-project-s1
 
Nvidia in bioinformatics
Nvidia in bioinformaticsNvidia in bioinformatics
Nvidia in bioinformatics
 
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
Report on GPGPU at FCA  (Lyon, France, 11-15 October, 2010)Report on GPGPU at FCA  (Lyon, France, 11-15 October, 2010)
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
 
Static Analysis and Code Optimizations in Glasgow Haskell Compiler
Static Analysis and Code Optimizations in Glasgow Haskell CompilerStatic Analysis and Code Optimizations in Glasgow Haskell Compiler
Static Analysis and Code Optimizations in Glasgow Haskell Compiler
 
Shared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorgan
Shared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorganShared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorgan
Shared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorgan
 
Gnocchi v3
Gnocchi v3Gnocchi v3
Gnocchi v3
 
Bioinformatics on GPU
Bioinformatics on GPUBioinformatics on GPU
Bioinformatics on GPU
 
Interactive Data Analysis for End Users on HN Science Cloud
Interactive Data Analysis for End Users on HN Science CloudInteractive Data Analysis for End Users on HN Science Cloud
Interactive Data Analysis for End Users on HN Science Cloud
 

Similar to Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU and MEGAN

BC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan PresentationBC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan Presentation
Elijah Willie
 
20141219 workshop methylation sequencing analysis
20141219 workshop methylation sequencing analysis20141219 workshop methylation sequencing analysis
20141219 workshop methylation sequencing analysis
Yi-Feng Chang
 

Similar to Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU and MEGAN (20)

Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
 
Paper - Muhammad Gulraj
Paper - Muhammad GulrajPaper - Muhammad Gulraj
Paper - Muhammad Gulraj
 
BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing data
 
BWA-MEM2-IPDPS 2019
BWA-MEM2-IPDPS 2019BWA-MEM2-IPDPS 2019
BWA-MEM2-IPDPS 2019
 
Cram 3.1 / Crumble
Cram 3.1 / CrumbleCram 3.1 / Crumble
Cram 3.1 / Crumble
 
Parallel Biological Sequence Comparison in GPU Platforms
Parallel Biological Sequence Comparison in GPU PlatformsParallel Biological Sequence Comparison in GPU Platforms
Parallel Biological Sequence Comparison in GPU Platforms
 
What’s new in 9.6, by PostgreSQL contributor
What’s new in 9.6, by PostgreSQL contributorWhat’s new in 9.6, by PostgreSQL contributor
What’s new in 9.6, by PostgreSQL contributor
 
Xomics brochure short version
Xomics brochure short versionXomics brochure short version
Xomics brochure short version
 
CNS_poster12
CNS_poster12CNS_poster12
CNS_poster12
 
Folker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data AnnotationFolker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data Annotation
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
BC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan PresentationBC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan Presentation
 
fastp: the FASTQ pre-processor
fastp: the FASTQ pre-processorfastp: the FASTQ pre-processor
fastp: the FASTQ pre-processor
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
Artificial Intelligence Database Performance Tuning
Artificial Intelligence Database Performance TuningArtificial Intelligence Database Performance Tuning
Artificial Intelligence Database Performance Tuning
 
20141219 workshop methylation sequencing analysis
20141219 workshop methylation sequencing analysis20141219 workshop methylation sequencing analysis
20141219 workshop methylation sequencing analysis
 
Cloud bioinformatics 2
Cloud bioinformatics 2Cloud bioinformatics 2
Cloud bioinformatics 2
 
cuTau Leaping
cuTau LeapingcuTau Leaping
cuTau Leaping
 
Initial steps towards a production platform for DNA sequence analysis on the ...
Initial steps towards a production platform for DNA sequence analysis on the ...Initial steps towards a production platform for DNA sequence analysis on the ...
Initial steps towards a production platform for DNA sequence analysis on the ...
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger
 

More from Masahito Ohue

More from Masahito Ohue (20)

学振特別研究員になるために~2024年度申請版
 学振特別研究員になるために~2024年度申請版 学振特別研究員になるために~2024年度申請版
学振特別研究員になるために~2024年度申請版
 
学振特別研究員になるために~2023年度申請版
学振特別研究員になるために~2023年度申請版学振特別研究員になるために~2023年度申請版
学振特別研究員になるために~2023年度申請版
 
学振特別研究員になるために~2022年度申請版
学振特別研究員になるために~2022年度申請版学振特別研究員になるために~2022年度申請版
学振特別研究員になるために~2022年度申請版
 
第43回分子生物学会年会フォーラム2F-11「インシリコ創薬を支える最先端情報科学」から抜粋したAlphaFold2の話
第43回分子生物学会年会フォーラム2F-11「インシリコ創薬を支える最先端情報科学」から抜粋したAlphaFold2の話第43回分子生物学会年会フォーラム2F-11「インシリコ創薬を支える最先端情報科学」から抜粋したAlphaFold2の話
第43回分子生物学会年会フォーラム2F-11「インシリコ創薬を支える最先端情報科学」から抜粋したAlphaFold2の話
 
Learning-to-rank for ligand-based virtual screening
Learning-to-rank for ligand-based virtual screeningLearning-to-rank for ligand-based virtual screening
Learning-to-rank for ligand-based virtual screening
 
Molecular Activity Prediction Using Graph Convolutional Deep Neural Network C...
Molecular Activity Prediction Using Graph Convolutional Deep Neural Network C...Molecular Activity Prediction Using Graph Convolutional Deep Neural Network C...
Molecular Activity Prediction Using Graph Convolutional Deep Neural Network C...
 
学振特別研究員になるために~2020年度申請版
学振特別研究員になるために~2020年度申請版学振特別研究員になるために~2020年度申請版
学振特別研究員になるために~2020年度申請版
 
出会い系タンパク質を探す旅
出会い系タンパク質を探す旅出会い系タンパク質を探す旅
出会い系タンパク質を探す旅
 
学振特別研究員になるために~2019年度申請版
学振特別研究員になるために~2019年度申請版学振特別研究員になるために~2019年度申請版
学振特別研究員になるために~2019年度申請版
 
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...
 
目バーチャルスクリーニング
目バーチャルスクリーニング目バーチャルスクリーニング
目バーチャルスクリーニング
 
Microsoft Azure上でのタンパク質間相互作用予測システムの並列計算と性能評価
Microsoft Azure上でのタンパク質間相互作用予測システムの並列計算と性能評価Microsoft Azure上でのタンパク質間相互作用予測システムの並列計算と性能評価
Microsoft Azure上でのタンパク質間相互作用予測システムの並列計算と性能評価
 
学振特別研究員になるために~2018年度申請版
学振特別研究員になるために~2018年度申請版学振特別研究員になるために~2018年度申請版
学振特別研究員になるために~2018年度申請版
 
計算で明らかにするタンパク質の出会いとネットワーク(FIT2016 助教が吼えるセッション)
計算で明らかにするタンパク質の出会いとネットワーク(FIT2016 助教が吼えるセッション)計算で明らかにするタンパク質の出会いとネットワーク(FIT2016 助教が吼えるセッション)
計算で明らかにするタンパク質の出会いとネットワーク(FIT2016 助教が吼えるセッション)
 
Finding correct protein–protein docking models using ProQDock (ISMB2016読み会, 大上)
Finding correct protein–protein docking models using ProQDock (ISMB2016読み会, 大上)Finding correct protein–protein docking models using ProQDock (ISMB2016読み会, 大上)
Finding correct protein–protein docking models using ProQDock (ISMB2016読み会, 大上)
 
学振特別研究員になるために~知っておくべき10のTips~[平成29年度申請版]
学振特別研究員になるために~知っておくべき10のTips~[平成29年度申請版]学振特別研究員になるために~知っておくべき10のTips~[平成29年度申請版]
学振特別研究員になるために~知っておくべき10のTips~[平成29年度申請版]
 
ISMB/ECCB2015読み会:大上
ISMB/ECCB2015読み会:大上ISMB/ECCB2015読み会:大上
ISMB/ECCB2015読み会:大上
 
学振特別研究員になるために~知っておくべき10のTips~[平成28年度申請版]
学振特別研究員になるために~知っておくべき10のTips~[平成28年度申請版]学振特別研究員になるために~知っておくべき10のTips~[平成28年度申請版]
学振特別研究員になるために~知っておくべき10のTips~[平成28年度申請版]
 
IIBMP2014 Lightning Talk - MEGADOCK 4.0
IIBMP2014 Lightning Talk - MEGADOCK 4.0IIBMP2014 Lightning Talk - MEGADOCK 4.0
IIBMP2014 Lightning Talk - MEGADOCK 4.0
 
PrePPI: structure-based protein-protein interaction prediction
PrePPI: structure-based protein-protein interaction predictionPrePPI: structure-based protein-protein interaction prediction
PrePPI: structure-based protein-protein interaction prediction
 

Recently uploaded

CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
gajnagarg
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
gajnagarg
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 

Recently uploaded (20)

CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 

Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU and MEGAN

  • 1. Parallelized Pipeline for Whole Genome Shotgun Metagenomics with GHOSTZ-GPU and MEGAN (DAY-3) Oct 29, 2019 B4 - Bioinformatics Session 4 (Sequence) Royal Olympic Hotel, Athens, GREECE Masahito Ohue1 Marina Yamasawa1,2 Kazuki Izawa1 Yutaka Akiyama1 1. Department of Computer Science, School of Computing, Tokyo Institute of Technology, JAPAN 2. Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL), National Institute of Advanced Industrial Science and Technology (AIST), JAPAN Paper-ID 228
  • 2. Agenda • Introduction – Metagenome – 16S rRNA vs. whole genome shotgun (WGS) metagenomics – Homology search, GHOSTZ-GPU – WGS metagenome workflow • GHOSTMEGAN Pipeline • Computational Experiments • Results and Discussion • Conclusion 1
  • 4. Metagenome Analysis • Directly sequencing uncultured microbiomes obtained from target environment and analyzing the sequence data – Finding novel genes from unculturable microorganism – Elucidating composition of species/genes of environments Human body SeaGut Examples of microbiome Soil Oral 3
  • 5. Home Microbiome Study Hospital Microbiome Project Earth Microbiome Project Marine Phage Sequencing Project National Metagenomic Project 4
  • 6. 16S rRNA Metagenomics vs. WGS Metagenomics 5 Analyzes DNA from amplicon sequencing of prokaryotic 16S small subunit ribosomal RNA genes. 16S rRNA Sequencing ✓Provides visuals of taxonomic classification ✓Low cost × Cannot search for functional genes Analyzes the untargeted ('shotgun') sequencing of all ('meta-') microbial genomes present in a sample. Whole Genome Shotgun (WGS) Sequencing ✓Provides visuals of taxonomic classification and functional genes × More costly
  • 7. H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 P1 P2 P3 P4 P5 P6 0% 20% 40% 60% 80% 100% (16S) Taxonomic Composition (Periodontal diseases) (Izawa K, et al. unpublished work) 6
  • 8. H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 P1 P2 P3 P4 P5 P6 0% 20% 40% 60% 80% 100% (WGS) Functional Gene Category Composition (Izawa K, et al. unpublished work) 7
  • 9. 16S Analysis Workflow 8Baichoo S, et al. BMC Bioinformatics, 19(1):457, 2018. (example) USEARCH mapping + Qiime summarization
  • 10. (example) homology search + summarization WGS Metagenome Analysis Flow 9 Smith-Waterman? BLAST? Toooo Slow!! Database Escherichia coli Daphnia pulex ATGCGAAATCGCTA… CGGCTCAGCGATCG… AATCG GCACA Query ×
  • 11. Rough Comparison of Homology Search Tools 10 BLAST Altschul 1990 BLAT Kent 2002 RAPSearch ver. 2.12 Ye 2011 Zhao 2012 DIAMOND ver. 0.7.9 Buchfink 2015 GHOSTZ Suzuki 2015 GHOSTZ-GPU Suzuki 2016 Sensitivity ✓ best × ✓ △ (fast) △ × (fast) ✓ ✓ Speed ratio (1) 50 100 1,600 (fast) 1,000 3,000 (fast) 400 1,500 (1 GPU) 2,000 (2 GPUs) 2,500 (3 GPUs) GPU △ × × × × ✓
  • 12. GHOSTZ Algorithm BLAST GHOSTZ Database Query sequences K-mer (neighborhood words) Gapless extension Gapped extension finite automaton Seed search Results Search K-mer substring match by using finite automaton Database Query sequences Hash table Gapless extension Gapped extension Results Subsequence clustering Seed search Hash table 11 Suzuki S, Kakuta M, Ishida T, Akiyama Y. Faster sequence homology searches by clustering subsequences. Bioinformatics, 31(8), 1183–1190, 2015. Distance calculation using cluster representatives
  • 13. Suzuki S, Kakuta M, Ishida T, Akiyama Y. GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering. PLoS ONE 11(8): e0157338, 2016. ERR315856 (Marine Microbiome Tara Oceans) against KEGG GENES DB RAPSearch GHOSTZ/GHOSTZ-GPU Sensitivity Homology search accuracy (sensitivity) Marine sample 12
  • 14. GHOSTZ/GHOSTZ-GPU Calculation Speed 13 0 2,000 4,000 6,000 8,000 10,000 12,000 computation time (sec.) 41,236 2,644 9,970 2,794 1,885 1,502 3,717 1,034 SRR407548 (Soil) + SRS011098 (Oral) + ERR315856(Marine) against KEGG GENES DB 1,000,000 randomly selected DNA reads from each datasets. CPU: 12 CPU threads Xeon5670, 2.93GHz GPU: Tesla K20X (sec) Suzuki S, Kakuta M, Ishida T, Akiyama Y. GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering. PLoS ONE 11(8): e0157338, 2016.
  • 15. Summarization Tool 14 Calculate the relative ratio of OTU and gene function using the output of BLAST and GHOSTZ-GPU * MEGAN itself is a pipeline tool (based on DIAMOND) Huson DH, et al. PLoS Comput Biol. (2016)
  • 16. WGS Metagenomics Pipeline • MetaWRAP – Does not handle homology searches • Preferably performs metagenome assembly – Does not support multi-node parallelization • MEGAN – Uses DIAMOND – DIAMOND does not support GPU acceleration – Thus a high-speed analysis is not possible • MiGAP – Uses BLAST – Thus also cannot perform highspeed analysis 15 Uritskiy GV, et al. Microbiome, 6(1), 158, 2018. Huson DH, et al. PLoS Comput Biol, 12: e1004957, 2016. Sugawara H, et al. Genome Inform, 2009.
  • 17. WGS Metagenome Analysis Flow 16 2,500 days (on normal laptop PC)BLAST Output 20-billion reads per 2-days (Illumina NovaSeq 6000) 6-hrs (on 28 cores & 4 GPUs workstation) 18-hrs (on 28 cores workstation) (Database: KEGG GENES DB, 1.3-million seqs) e.g. analysis of 100-million reads (150 bp) Further speedup is needed! ▶ multi-node parallel computing
  • 18. Purpose of This Study • Developing new WGS metagenome analysis system, GHOSTMEGAN – Pipeline the sequence homology search and post-process – Perform distributed computation on parallel computers • Linking GHOSTZ-GPU and MEGAN – GHOSTZ-GPU is the fastest sequence homology search tool that supports multi-GPU computation • Performance evaluation – Evaluate using an actual WGS metagenome dataset by parallel execution on a multi-node GPU cluster 17
  • 20. Overview • Simple workflow • Focused on the cluster machine (multi-GPUs x multi-nodes supercomputer) 19
  • 21. GHOSTMEGAN Pipeline on Cluster System 20 Query (fasta file) Divide fasta fasta.1 fasta.2 fasta.n GHOSTZ- GPU GHOSTZ- GPU GHOSTZ- GPU tsv.1 tsv.2 MEGAN MEGAN MEGAN tsv.n rma.1 rma.2 rma.n Concat rma Results (rma file) … … … … … singlenode (A) Dividing query (B) Sequence homology search by GHOSTZ-GPU (C) Analyzing by MEGAN (D) Integrating results
  • 22. 21 Query (fasta file) Divide fasta fasta.1 fasta.2 fasta.n… (A) Dividing Query • Input file (query) for WGS metagenome analysis is a huge single fasta file • The query file is divided to n files for n compute nodes • The processing time for dividing queries is extremely small compared with the other steps n nodes
  • 23. (B) Sequence Homology Search by GHOSTZ-GPU 22 fasta.1 fasta.2 fasta.n… n nodes GHOSTZ-GPU GHOSTZ-GPU GHOSTZ-GPU • GHOSTZ-GPU is executed for individual divided query files on a node – Thread parallel computation using all CPU/GPU resources • Genome DB is stored in the local storage in all nodes • The output of GHOSTZ-GPU is a tab-delimited BLAST format file – E-value < 10-5 results are provided to the next step tsv.1 tsv.2 tsv.n…
  • 24. (C) Analyzing by MEGAN 23 • MEGAN blast2rma command is performed (only using CPUs) • The computation is performed independently for each read sequence search result in the rma file, which will not be affected by dividing of queries n nodes tsv.1 tsv.2 tsv.n… rma.1 rma.2 rma.n… MEGAN blast2rma MEGAN blast2rma MEGAN blast2rma
  • 25. (D) Integrating Results 24 • After all MEGAN blast2rma process, MEGAN compute-comparison command is run – integrates multiple analysis results into a single file • Then MEGAN extract-biome is used to summarize the whole results compute-comparison extract-biome MEGAN MEGAN Results (rma file) rma.1 rma.2 rma.n…
  • 26. GHOSTMEGAN Pipeline on Cluster System 25 To ensure usability, only one parameter file needs to be edited GHOSTMEGAN pipeline
  • 28. Hardware Specification 27 TSUBAME 3.0 compute node specification (f_node) CPU Intel Xeon E5-2680 v4 (2.4 GHz 14 cores) × 2 GPU NVIDIA Tesla P100 NVLink (16 GB) × 4 RAM 256 GiB Local storage Intel SSD DC P3500 (2 TB) Network Intel Omni-Path 100 Gb/s × 4 Job scheduler Univa Grid Engine 8.5.4C104 11 • TSUBAME 3.0 – 25th-ranked supercomputer (Top500, 8.1 Petaflops, Jun 2019) – 15,120 CPU cores – 2,160 NVIDIA P100 GPUs We performed GHOSTMEGAN with n nodes running in parallel using n of 1, 2, 4, 8, 16, 32, 64, and 128 as the query division number, respectively, and compared the execution times and speedup rates.
  • 29. Software and Dataset 28 Homology search: GHOSTZ-GPU ver. 1.1.0 Post process: MEGAN ver. 6.12.6 $ blast2rma --in [GHOSTZ output] –out [MEGAN rma file] --format BlastTab $ ghostz-gpu aln -d [DB] -b 1 -q d -a 1 –g 3 –I [query] • Query sequences: human oral WGS metagenome reads – Duran-Pinedo AE, et al. ISME J, 8(8), 1659–1672, 2014. – The query used a random sample of 1,000,000 reads (100 bp) from periodontally healthy individual samples (145 MB) • Database: NCBI nr – 166,109,435 seqs (101 GB) – ftp://ftp.ncbi.nih.gov/blast/db/ (accessed August 18, 2018) Dataset: https://github.com/akiyamalab/ghostz-gpu http://megan.informatik.uni-tuebingen.de
  • 31. (1) Overall Pipeline Execution Time 30 15 hours 20 min24 min 33 min • The maximum acceleration was ~45-times (on 128 nodes) • GHOSTZ-GPU was too fast, and the calculation time was saturated
  • 32. (2) Parallel Efficiency (Scalability) 31 strong scaling = (speedup by n nodes against 1 node) / n strong scaling = 0.87 0.60 0.35 0.93 0.98 • Linear speed improvement was obtained between 1 to 32 nodes, strong scaling = 0.87
  • 33. Summary of the Results • MEGAN scaling was good • GHOSTZ-GPU scaling decreased at n > 32 – The query data was small – Expect high efficiency for larger queries • This time it was difficult because n = 1 had to be measured • Strong scaling against n = 8 can be measured for the larger query, for example • MEGAN without GPU-implementation has room for acceleration – In order to cope with the increase queries, it is also necessary to speed up by the GPUs other than the homology search 32
  • 34. Homology Search Results 33 compute on 2 nodes others XP 025968818.1 LOW QUALITY PROTEIN: tigger transposable element- derived protein 1-like [Dromaius novaehollandiae] XP 019376199.1 PREDICTED: tigger transposable element-derived protein 1- like, partial [Gavialis gangeticus] compute on 1 node others BAD18412.1 unnamed protein product [Homo sapiens] EHH57573.1 hypothetical protein EGM 07242, partial [Macaca fascicularis] read a read b ✓ “tigger transposable element-derived protein 1- like” gene is widely conserved ✓ The result did not affect the WGS metagenome analysis ✓ Both were annotated as function-unknown genes ✓ The result also did not affect the metagenome analysis at this time We found only two reads with different homology search results out of 1-million reads in the parallel computing of GHOSTMEGAN for the dataset
  • 36. Conclusion • GHOSTMEGAN pipeline was developed and evaluated to achieve large-scale metagenomic analysis – Homology search and other process were parallelized – Executed on the TSUBAME 3.0 supercomputer with multiple GPUs • GHOSTMEGAN achieved parallel computing on multiple compute nodes – Obtained linear speedup to 32 nodes – 45-times faster calculation on 128 nodes • GPU-accelerated MEGAN or other tools will be crucial – GHOSTZ-GPU was significantly accelerated on multiple GPUs – To prepare for further increases in data size in the future 35