Characterizing Protein Families of Unknown Function
Metagenomic Samples <br />Characterizing Protein Families of Unknown Function<br />All Proteins (7.3 M, 1900 genomes)<br />Morgan G.I. Langille and Jonathan A. Eisen<br />Genome Center, University of California, Davis, CA<br />Abstract<br />Perhaps one of the most frustrating aspects of genomic and metagenomic analysis is that functional predictions for many genes cannot be identified. These "hypothetical" or "unknown" genes represent a significant fraction of the genes in most genomes or metagenomes, and this fraction will likely increase as sequencing technology continues to outpace functionally informative lab experiments. To start tackling this situation we characterized and ranked protein families with unknown function from completed genome sequences. Ranking of these families were done using several metrics such as quantity of members, presence across tree of life, presence in mostly pathogens or other habitats, etc. This ranking allows particular families of unknown function to be targeted for more in-depth analysis due to their ubiquitous nature or their role in a particular niche. In addition to ranking these families, we analyze their abundance profiles across several metagenomic studies and cluster them with families of known function in the hopes of making novel functional predictions. <br />No Informative Product Annotation** <br />(2.7 M)<br />“Unknown” Genes <br />(1.8 M)<br />No PFam Hits* <br />(2.2 M)<br />Rank by size (# of proteins in the family) <br /><ul><li>10 unknown families > 1000 sequences
40 unknown families > 500 sequences</li></ul>Rank by universality (% presence across all species) <br /><ul><li>26 unknown families > 10% in Bacateria and Eukaryotes
6 unknown families > 10% in all 3 domains</li></ul>Rank by family presence in pathogenic species<br /><ul><li>75 unknown families (size > 100) with > 80% proteins existing only in species listed as pathogens</li></ul>Rank by family presence in particular habitats (e.g. Aquatic)<br /><ul><li>12 unknown families (size > 50) with > 80% proteins existing only in species with aquatic habitat </li></ul>3.1 Hypothesis for “Community Profiling”<br /><ul><li>By looking at the abundance profiles for all protein families across many diverse metagenomic samples we hypothesize that protein families with similar profiles will have similar function. If so, novel predictions can be made for families with unknown function that have similar profiles to known functional protein families. </li></ul>3.2 Current Approach for “Community Profiling”<br />Conclusions<br />Objectives<br />1. Identify families of genes with unknown function<br />2. Characterize and rank families of unknown function<br />3. Produce novel predictions using metagenomics data<br />Identify families of genes with unknown function<br />Characterize and rank families with unknown function<br />Produce novel predictions using metagenomics data<br />1.1 Data Source<br /><ul><li>All genomes from IMG were downloaded Dec. 2009.
1895 genomes (Eukaryota, Bacteria, and Archaea)
*Samples are normalized by dividing by the sum of each column. Subtraction of taxonomic signal is planned in the near future.
** Distance between protein families have been calculated using Pearson’s Correlation and Sørensen similarity index
Additional metagenomic samples are being added for improved resolution between protein family profiles.</li></ul>1.2 Identify proteins with unknown function<br />HMMER 3<br />CAMERA’s <br />“All metagenomic proteins”<br />Sanger Seq. Mostly GOS.<br />43 M proteins, 102 Samples<br />All PFAMs<br />(11,000)<br />3.3 Preliminary Results<br />Count PFAM hitsper sample<br />Identify similar protein family profiles (using R & Cytoscape)<br /><ul><li>*Pfam hits determined using HMMER 3 and Pfam-A version 24. Non-informative PFams such as DUFs (Domains of Unknown Function) and UPFs (Unidentified Protein Families) were not counted.
**No Informative Product Annotation identified by searching protein product description (e.g. "hypothetical protein", "predicted protein", etc. )
By taking the intersection of these two methods for identifying genes with no known function we are left with 1.8 million proteins. Thus, 25% of all proteins from completed genomes are of unknown function. </li></ul>Calculate distances between protein family profiles **<br />Normalize Samples*<br />1.3 Protein families with unknown function<br /><ul><li>Protein families were obtained from IMG (“IMG Ortholog Clusters”) and those families with >70% of their members having unknown function (see 1.2) were labelled as “families of unknown function”.
144,492 familes with unknown function(size >1)
PFam hits to proteins from 102 metagenomic samples (see 3.1) shown as a heat map with hierarchal clustering
PFamswith metagenomic profiles (see 3.1) having correlations > 0.9 were visualized using Cytoscape</li></ul>3.4 Future Work<br /><ul><li>Various normalization methods, distance calculations, and clustering methods are being investigated in order to maximize the clustering of protein families with similar function.
Additional metagenomic samples are being screened against PFams to provide better resolution for those with similar profiles.
Extension to protein families other than PFams is planned in the near future.
In this study we created a list of protein families that have no known function. These protein families were then ranked on various criteria that would allow researchers to identify particular genes with unknown function that are a high level of interest due to their presence across the tree of life, their possible role in pathogenisis, or their contribution to species in particular environments.
Secondly, we have started development on a completely novel method that would predict gene function that does not use sequence similarity and would in theory improve as the number of metagenomic datasets become available over time. This method could help gain some insight into the function of the vast number of proteins that we currently can not annotate.</li>