High-throughput DNA sequencing continue to offer comprehensive insights into microbial ecosystems1. Several bioinformatics tools have been inconclusively benchmarked2, yet variations in algorithms are known to impact the microbiome results3. Thus, there is need for detailed benchmarking of bioinformatics tools. Here we validated 16S rRNA amplicon sequencing and four bioinformatics tools for microbiome analyses.
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Benchmarking 16S rRNA gene sequencing and bioinformatics tools for identification of microbial abundances
1. Benchmarking 16S rRNA gene sequencing and bioinformatics tools
for identification of microbial abundances
Acknowledgments
The authors acknowledge CRG Genomics Core Facility for their sequencing services, CRG Bioinformatics Core Facility and
UCT ICTS High Performance Computing team for their computing facilities. The project was financed by CRG through
Genomics and Bioinformatics Core Facilities funds as part of the “Saca la Lengua” project, which is an initiative of and the
“la Caixa” Foundation, with the participation of the Center for Research into Environmental Epidemiology (CREAL), and the
“Center d’Excellència Severo Ochoa 2013-2017” programme (SEV-2012-02-08) of the Ministry of Economy and
Competitiveness. David Harris Onywera received a grant from the CRG-Novartis-Africa Mobility Programme.
1Bioinformatics Core Facility, Centre for Genomic Regulation (CRG), Dr. Aiguader 88, Barcelona, Spain; 2Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain; 3Institute of Infectious Disease and Molecular
Medicine (IDM), University of Cape Town (UCT), Anzio Road, Observatory 7925, Cape Town, South Africa
Introduction
High-throughput DNA sequencing continue to offer comprehensive insights into microbial ecosystems1.
Several bioinformatics tools have been inconclusively benchmarked2, yet variations in algorithms are known to
impact the microbiome results3. Thus, there is need for detailed benchmarking of bioinformatics tools. Here
we validated 16S rRNA amplicon sequencing and four bioinformatics tools for microbiome analyses.
Methods
Genomic DNA from two microbial mock communities (Even: HM782D, Staggered: HM783D, BEI Resources)
was sequenced by shotgun and V3-V4 16S rRNA sequencing on Illumina HiSeq and MiSeq, respectively.
For 16S rRNA and whole DNA, eight and three independent sequencing runs were performed, respectively.
All reads were mapped to a database of 20 reference bacterial genomes using Bowtie24.
Four bioinformatics tools for 16S rRNA analysis – mothur5, QIIME6, QUPARSE (UPARSE7 imported into
QIIME6) and riboPicker (based on the skewer8, pear9 and ribopicker10 algorithms) were set up and tested.
Taxonomic annotations on globally trimmed non-chimeric representative sequences in QIIME, mothur, and
riboPicker were performed by the RDP Classifier using the SILVA database v119 with ≥90% bootstrap
confidence. In QUPARSE, the Greengenes Database (13_8 Release) was used.
Distributions of relative taxa abundances estimated by each tool were compared with the number rRNA
operons, provided by BEI Resources and obtained from the whole genome sequencing (WGS).
Performance of the methods were evaluated using the HMP parametric R statistical package11.
Conclusion
WGS and 16S approaches gave significantly different species distributions in both mocks.
Genera distributions in the staggered mock by all tools were similar to the 16S rRNA mapping data.
mothur and QUPARSE had similar and significantly lower FPs and FNs (genera) than riboPicker and
QIIME, at different thresholds on the genera abundance in all mocks. FN results are not shown.
QUPARSE did not assign to any genera more than half of sequenced reads. Its performance was not as
satisfactory as other tools’ on the even mock.
mothur performed better than the other three bioinformatics tools that were tested.
Luca Cozzuto1,2, Carlos Company1,2, Nuria Andreu Somavilla1,2, Jochen Hecht1,2, David Harris Onywera1,3 and Julia Ponomarenko1,2
Mock bacterial community sequencing and analysis
Results
References
1. Franzosa, E.A.etal.Sequencing andbeyond:integrating molecular 'omics' formicrobial community profiling. Nat.Rev.Methods13,360–372(2015).
2. Sun, Y. et al. A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis. Brief. Bioinform 13, 107-
121(2012).
3. White,J.R.etal.Alignment andclustering ofphylogenetic markers -implications formicrobial diversity studies. BMCBioinfomatics 11,152(2010).
4. Langmead, B.&Salzberg, S.L.Fast gapped-read alignment withBowtie 2.Nat.Methods9,357-359(2012).
5. Schools, P. D. et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial
communities. Appl.Environ.Microbiol. 75,7537-7541(2009).
6. Caporaso, J.G.etal.QIIMEallows analysis ofhigh-throughput community sequencing data.Nat.Methods7,335–336(2010).
7. Edgar,R.C.UPARSE:highlyaccurate OTUsequences frommicrobial amplicon reads. Nat.Methods10,996–8(2013).
8. Jiang,H.etal.Skewer: afast andaccurate adapter trimmer fornext-generation sequencing paired-end reads. BMCBioinformatics 15,182(2014).
9. Zhang,J.etal.PEAR:afast andaccurate Illumina Paired-End reAdmergeR.Bioinformatics 30,614-620(2014).
10. Schmieder, R.etal.Identification andremoval ofribosomal RNAsequences frommetatranscriptomes. Bioinformatics 28,433-435(2012).
11. LaRosa,P.etal.Hypothesis testing andpowercalculations fortaxonomic-based humanmicrobiome data.PLOSONE7,e52078(2012).
Figure 1. Benchmarking metagenomics pipelines using mock communities. Bacterial DNA were extracted, and amplicons barcoded for
sequencing. Tools and sequencing performances were statistically computed.
luca.cozzuto@crg.eu; carlos.company@crg.eu; harris.onywera@crg.eu; julia.ponomarenko@crg.eu
Species abundances were significantly different between 16S and WGS approaches
Figure 2. Species theoretical and observed abundances. a) Even mock community, b) staggered mock community.
Figure 3. Genera relative abundances of mock genera. a) Histograms of genera distributions of eight mocks by each tool, b) Bar plots
comparing genera proportions of each tool against one another and 16S mapping data. All but QUPARSE results were similar to 16S
mapping data (QUPARSE: p-value < 0.0004, based on the Likelihood-Ratio test statistic comparing the Drichlet parameter vectors).
All but QUPARSE distributions were not significantly different from 16S mapping data: Even
Distributions by all tools were not significantly different from 16S mapping data: Staggered
Figure 4. Genera relative abundances of mock genera. a) Histograms of genera distributions of eight mocks by each tool, b) Bar plots
comparing genera proportions of each pipeline against one another and 16S mapping data. All results were similar.
Significant differences in fraction of assigned reads and false-positively assigned reads
Figure 5. Fraction of all sequenced reads. QIIME and
riboPiker assigned >70% of sequenced reads, which was
significantly more than mothur or QUPARSE did.
Figure 6. Proportion of false-positively assigned reads.
Percentage of false-positively assigned reads was low in all
tested methods.
Figure 8. Staggered mock, threshold on 0.022% and 0.01% abundances.
mothur and QUPARSE had similar number of positive genera, which was
significantly lower (p-value < 0.001) than QIIME’s or riboPiker’s.
Significant differences in false genera at different thresholds on relative abundances
Figure 7. Even mock. mothur and QUPARSE had
similar and significantly lower number of false positive
genera than QIIME and riboPicker (p-value < 0.001).