Inferring microbial community function from taxonomic composition Morgan G.I. Langille1,*, Jesse R.R. Zaneveld2, J Gregory Caporaso3, Joshua Reyes4, Dan Knights5, Daniel McDonald6, Rob Knight5, Robert G. Beiko1, Curtis Huttenhower4 1Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada; 2Dept. of Microbiology, Oregon State University, Corvallis, OR, USA; 3Dept. of Computer Science, Northern Arizona University, Flagstaff, AZ, USA;4Dept. ofBiostatistics, Harvard School of Public Health, Boston, MA, USA; 5Dept. Computer Science, University of Colorado, Boulder, CO, USA; 6Biofrontiers Institute, University of Colorado, Boulder, CO, USA; *firstname.lastname@example.org Abstract It is often most efficient to characterize microbial communities using taxonomic markers such as 3. Genome Validation the 16S ribosomal small subunit rRNA gene. The 16S gene is typically used to describe the organisms or taxonomic units present in a sample, but data from such markers do not inherently 3.1 Method reveal the molecular functions or ecological roles of members of a microbial community. We have 1) Remove a single genome from our reference dataset (pretending it has not been sequenced) developed and validated a novel computational method that takes a set of observed taxonomic 2) Use PI-CRUST to predict the functional abundances for our “unknown” genome using only its 16S gene abundances and infers abundance profiles of enzymes and pathways from multiple functional 3) Compare PI-CRUST predictions vs. the known functional abundances of our genome classification schemes (KEGG, PFAM, COG, etc.). We use ancestral state reconstruction to 4) Repeat for all completed genomes (>2000) determine approximate genomic content, taking into account 16S copy number and known 5) Plot the distribution of accuracy values for each genome (3.2) or each functional group (3.3) functional abundance profiles from all currently available microbial genomes. We have evaluated the accuracy of this inference for different groups of taxa and for different areas of biological function. Our method, implemented as the PI-CRUST software (Phylogenetic Investigation of 3.2 PI-CRUST accuracy for completed genomes Communities by Reconstruction of Unobserved STates), allows 16S metagenomic based studies to be extended to predict the functional abilities of microbiomes as well as to compare expected Using Various Ancestral State Reconstruction Distance to nearest genome affects accuracy versus observed functions in shotgun based metagenomic experiments. 1. PI-CRUST Software Pipeline 1.1 Starting Data Sources (Internally used by PI-CRUST) • Entire GreenGenes 16S reference tree. • A functional “Trait Table” for all completed genomes (e.g. KEGG, PFAM, etc.). This contains abundances of each functional category for each genome in the IMG database. Endosymbionts& • 16S copy number information for each completed genome in IMG (used to normalize OTU tables) Reduced Genomes • GreenGenes identifier to IMG completed genomes map (to link information we have about completed genomes to tips in our reference tree). 1.2 PI-CRUST: Genome Functional Predictions 16S phylogenetic distance to nearest species 16S Copy Genome Known functional composition “Random”: Functional abundances are chosen randomly from each of its distributions in all genomes. Number (completed & Functional Table (completed (from sequenced genome) Inferred ancestral “Nearest Neighbour”: Functional profile from genome with closest 16S distance is used. “PIC”: Ancestral state reconstruction using least squares regression (APE R package). genomes only) genomes only) functional composition “WAGNER”: Ancestral state reconstruction using Wagner parsimony (Count package). Predicted functional composition (for unsequenced genome) Reference 16S Tree (greengenes) 3.3 PI-CRUST accuracy for various functional groups 16S Copy Functional Number Trait Predictions Predictions Prune taxa with no genome information Predict Infer ancestral functional genome traits compositions 1.3 User Input • “OTU table”, Number of OTUs (with greengenes identifiers) per sample 1.4 PI-CRUST: Metagenome Functional Predictions 16S Copy Normalized OTU Table Number OTU Table Predictions PI-CRUST Accuracy (for each SEED function) Functional Metagenome The ability to predict functions from 16S varies depending on the functional class. Functions that are well Normalized Functional conserved and evolve similarly to 16S have higher accuracy, such as “RNA metabolism” and “Cell Division Trait OTU Table Predictions and Cell Cycle”. Other groups that tend not to be inherited by vertical descent such as “Phages, Prophages, Predictions Transposable Elements, Plasmids” are not predicted as accurately. 2 Metagenome Validation 4 Concluding Remarks 2.1 Method 1) Obtain microbiome samples with both whole metagenomic and 16S sequencing 4.1 Discussion 2) Use PI-CRUST with 16S data to predict functions for samples • Genome content has been shown in the past to vary widely even in closely related species. However, 3) Compare PI-CRUST predictions with functions observed from sequencing this may not be typical for the majority of bacterial and archaeal species. Our ability to predict the functions encoded in an organism based solely by its 16S gene and knowledge from the thousands of completed genomes suggests that gene content often has good phylogenetic correlation with 16S. 2.2 PI-CRUST accuracy on HMP samples • PI-CRUST allows 16S-only studies to be expanded to include information about functional abundances. • Studies with full metagenomic sequencing can use PI-CRUST to identify functions that are observed but not expected based on their 16S profiles (i.e the taxa that are present in the sample). 4.2 Availability & Future Plans • PI-CRUST is still under development but will be freely available under the GPL at: http://picrust.sourceforge.net • Various methods of ancestral state reconstruction and confidence weighting are still being evaluated. • Evaluation of PI-CRUST on other paired metagenomic and 16S datasets is underway. Acknowledgements PI-CRUST predicted abundance based on 16S data • MGIL is the recipient of an IHMC travel award funded by the NIH. Each point represents the predicted vs. observed relative abundance for a single KEGG category • MGIL and RGB are supported by a CIHR emerging team grant.