1. We propose here a solution for predictive network biomarker identification on Next
Generation Sequencing (NGS) metagenomic datasets, extending machine learning
classifiers, in a bioinformatic pipeline inspired to the FDA/SEQC study [1][2].!
!
The whole procedure relies on three main modules, namely data preprocessing, the
machine learning profiling and the differential network analysis. We combine a number of
well-known Open Source software tools and a family of ad-hoc solutions.!
!
Here we show an application of our workflow to Inflammatory Bowel Disease (IBD) and
dysbiosis on original high-quality phenotype data from Ospedale Pediatrico Bambino Gesù,
Rome.!
Introduction
Pipeline overview
A. Preprocessing
Raw SFF files were preprocessed by Mothur v1.33.3 [3], removing:!
!
1. Sequencing primers and barcodes !
2. Reads shorter than 200 bp!
3. Homopolymers longer than 8 bp !
4. Reads with ambiguous bases!
5. Reads with average Phred quality score < 35 over 50 bp moving windows
B. Quantification
QIIME v1.8.0 [4] was used to pick Operational
Taxonomic Units (OTUs) from preprocessed reads, !
following a de novo OTU picking protocol against the
Greengenes database 13_8 with the UCLUST
algorithm.!
!
1. Sequences with distance-based similarity level 97% or greater were
clustered together!
2. OTUs failing taxonomic assignment were flagged as Unassigned!
3. Seven taxonomic levels (from Kingdom to Species) are available for
taxonomic annotation
C. Predictive Profiling
WebValley 2014
A metagenomic pipeline integrating predictive
profiling methods and complex networks
for the analysis of NGS microbiome data
A. Zandonà, M. Chierici, G. Jurman, C. Furlanello,
S. Cucchiara, F. Del Chierico, L. Putignani
Conclusions
IBD status in fecal samples (FEC_H_IBD) was predicted with MCC = 0.73 and 20
features, while IBD status could not be predicted in biopsies (B_H_IBD).!
!
For FEC_B_IBD, OTUs belonging to Clostridiales and Bacteroidales were ranked among
the top elements. For FEC_H_IBD, among the top ranked features are Genera belonging
to Rikenellaceae, Barnesiellaceae, Coriobacteriaceae, and Lachnospiraceae. A correlation
network comparison on the co-abundances of the top 30 features for FEC_B_IBD via the
HIM distance highlighted a link between Veillonellaceae Family and Dialister Genus that
is lost.!
!
For FEC_H_IBD, network comparison highlighted no conserved links for PCC > 0.6;
moreover, an unspecified genus of the Proteobacteria Phylum is linked to another genus
of the same Phylum in healthy subjects, while it forms a link to Streptococcus Genus
belonging to the Phylum Firmicutes.!
Results (D)
FEC_H_IBD Classification task.!
Co-abundance networks on top-ranked features. !
Gray edges: links con- served between healthy subjects (H, left) and
IBD patients (IBD, right) !
Green edges: links conserved in H only !
Red edges: links conserved in IBD only. !
Edge thickness is proportional to the absolute value of Pearson
Correlation Coefficient (PCC). Edges are thresholded at PCC > 0.5.!
D. Network Analysis
Based on the netTools R package, ReNette [8] includes
methods for differential network analysis, including
the HIM (Hamming-Ipsen-Mikhailov) glocal distance.!
Starting from predictive signatures, co-abundance
undirected weighted networks were built using top-
features as nodes from cohorts corresponding to
patients phenotypes in terms of the (thresholded)
absolute Pearson Correlation Coefficient (PCC).
Finally, the structures of the obtained microbiome
networks are compared by quantifying network
distances using the glocal HIM distance [6,7].!
Results (C)
M1: OTU table filtered by discarding unassigned OTUs!
M2: OTU table filtered by discarding both unassigned OTUs and !
! those with unspecified levels in their taxonomic lineage and !
! for whom no siblings have been annotated!
G: Genus-level OTU table
S: Canberra stability indicator of the ranked feature list [5]!
A comprehensive assessment of !
RNA-Seq accuracy, reproducibility
and information content by the
Sequencing Quality Control
Consortium. [2]!
The MAQC-II Project: !
A comprehensive study of common
practices for the development and
validation of microarray- based
predictive models. [1]!
Data & Classification
References
Machine Learning
Network Analysis
Platforms
Roche 454 gut microbiome 16S rRNA-Seq measurements: !
!
• 60 fecal samples from 30 healthy and 30 IBD children!
• 15 matched normal/inflamed colon tissue biopsies
Classification tasks:!
FEC_H_IBD: 27 healthy vs. 30 IBD, fecal content!
FEC_H_B_H: 30 fecal samples from healthy !
! ! subjects vs. 15 healthy tissue biopsies
! ! from IBD patients!
FEC_B_IBD: 30 fecal samples from IBD patients !
! ! vs. 15 inflamed tissue biopsies!
B_H_IBD: 15 normal vs. 15 inflamed tissue biopsies
FBK-KORE cluster: 109
compute nodes, 1120 CPU
cores, 8 TB RAM, !
200 TB storage for
bioinformatics (Nov 2014)!
Inference: Pearson correlation
coefficient on the top ranked
features.!
!
Networks distance: glocal
Hamming- Ipsen-Mikhailov !
(HIM) distance.!
Three different classifiers:!
!
1. Linear Support Vector Machine (L2L1
and L2L2 penalties)!
2. Logistic Regression (L1 penalty) !
3. Random Forest!
!
To ensure results reproducibility, we
adhered to a Data Analysis Protocol (DAP)
derived by the FDA MAQC-II !
and SEQC projects [1, 2].
[1] The MAQC Consortium, Nat. Biotechnol., 2010
[2] The SEQC/MAQC-III Consortium, Nat. Biotechnol., 2014 !
[3] P.D. Schloss et al, Appl. Environ. Microbiol., 2009
[4] J.G. Caporaso et al, Nat. Methods, 2010
[5] G. Jurman et al, Bioinformatics, 2008
[6] G. Jurman et al, arXiv, 2012
[7] M. Filosi et al, PLoS ONE, 2014
[8] M. Filosi et al, bioRxiv, 2014!