• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Characterizing Protein Families of Unknown Function
 

Characterizing Protein Families of Unknown Function

on

  • 1,697 views

by Morgan G. I. Langille & Jonathan A. Eisen. This scientific poster was presented at the 18th Annual International Meeting on Microbial Genomics at Lake Arrowhead, California, USA. Sept. 12-16, ...

by Morgan G. I. Langille & Jonathan A. Eisen. This scientific poster was presented at the 18th Annual International Meeting on Microbial Genomics at Lake Arrowhead, California, USA. Sept. 12-16, 2010.

Statistics

Views

Total Views
1,697
Views on SlideShare
1,696
Embed Views
1

Actions

Likes
1
Downloads
23
Comments
0

1 Embed 1

http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Characterizing Protein Families of Unknown Function Characterizing Protein Families of Unknown Function Document Transcript

    • Metagenomic Samples
      Characterizing Protein Families of Unknown Function
      All Proteins (7.3 M, 1900 genomes)
      Morgan G.I. Langille and Jonathan A. Eisen
      Genome Center, University of California, Davis, CA
      Abstract
      Perhaps one of the most frustrating aspects of genomic and metagenomic analysis is that functional predictions for many genes cannot be identified. These "hypothetical" or "unknown" genes represent a significant fraction of the genes in most genomes or metagenomes, and this fraction will likely increase as sequencing technology continues to outpace functionally informative lab experiments. To start tackling this situation we characterized and ranked protein families with unknown function from completed genome sequences. Ranking of these families were done using several metrics such as quantity of members, presence across tree of life, presence in mostly pathogens or other habitats, etc. This ranking allows particular families of unknown function to be targeted for more in-depth analysis due to their ubiquitous nature or their role in a particular niche. In addition to ranking these families, we analyze their abundance profiles across several metagenomic studies and cluster them with families of known function in the hopes of making novel functional predictions.
      No Informative Product Annotation**
      (2.7 M)
      “Unknown” Genes
      (1.8 M)
      No PFam Hits*
      (2.2 M)
      Rank by size (# of proteins in the family)
      • 10 unknown families > 1000 sequences
      • 40 unknown families > 500 sequences
      Rank by universality (% presence across all species)
      • 26 unknown families > 10% in Bacateria and Eukaryotes
      • 6 unknown families > 10% in all 3 domains
      Rank by family presence in pathogenic species
      • 75 unknown families (size > 100) with > 80% proteins existing only in species listed as pathogens
      Rank by family presence in particular habitats (e.g. Aquatic)
      • 12 unknown families (size > 50) with > 80% proteins existing only in species with aquatic habitat
      3.1 Hypothesis for “Community Profiling”
      • By looking at the abundance profiles for all protein families across many diverse metagenomic samples we hypothesize that protein families with similar profiles will have similar function. If so, novel predictions can be made for families with unknown function that have similar profiles to known functional protein families.
      3.2 Current Approach for “Community Profiling”
      Conclusions
      Objectives
      1. Identify families of genes with unknown function
      2. Characterize and rank families of unknown function
      3. Produce novel predictions using metagenomics data
      Identify families of genes with unknown function
      Characterize and rank families with unknown function
      Produce novel predictions using metagenomics data
      1.1 Data Source
      • All genomes from IMG were downloaded Dec. 2009.
      • 1895 genomes (Eukaryota, Bacteria, and Archaea)
      • 7.3 M Proteins
      • *Samples are normalized by dividing by the sum of each column. Subtraction of taxonomic signal is planned in the near future.
      • ** Distance between protein families have been calculated using Pearson’s Correlation and Sørensen similarity index
      • Additional metagenomic samples are being added for improved resolution between protein family profiles.
      1.2 Identify proteins with unknown function
      HMMER 3
      CAMERA’s
      “All metagenomic proteins”
      Sanger Seq. Mostly GOS.
      43 M proteins, 102 Samples
      All PFAMs
      (11,000)
      3.3 Preliminary Results
      Count PFAM hitsper sample
      Identify similar protein family profiles (using R & Cytoscape)
      • *Pfam hits determined using HMMER 3 and Pfam-A version 24. Non-informative PFams such as DUFs (Domains of Unknown Function) and UPFs (Unidentified Protein Families) were not counted.
      • **No Informative Product Annotation identified by searching protein product description (e.g. "hypothetical protein", "predicted protein", etc. )
      • By taking the intersection of these two methods for identifying genes with no known function we are left with 1.8 million proteins. Thus, 25% of all proteins from completed genomes are of unknown function.
      Calculate distances between protein family profiles **
      Normalize Samples*
      1.3 Protein families with unknown function
      • Protein families were obtained from IMG (“IMG Ortholog Clusters”) and those families with >70% of their members having unknown function (see 1.2) were labelled as “families of unknown function”.
      • 144,492 familes with unknown function(size >1)
      • PFam hits to proteins from 102 metagenomic samples (see 3.1) shown as a heat map with hierarchal clustering
      • PFamswith metagenomic profiles (see 3.1) having correlations > 0.9 were visualized using Cytoscape
      3.4 Future Work
      • Various normalization methods, distance calculations, and clustering methods are being investigated in order to maximize the clustering of protein families with similar function.
      • Additional metagenomic samples are being screened against PFams to provide better resolution for those with similar profiles.
      • Extension to protein families other than PFams is planned in the near future.
      • In this study we created a list of protein families that have no known function. These protein families were then ranked on various criteria that would allow researchers to identify particular genes with unknown function that are a high level of interest due to their presence across the tree of life, their possible role in pathogenisis, or their contribution to species in particular environments.
      • Secondly, we have started development on a completely novel method that would predict gene function that does not use sequence similarity and would in theory improve as the number of metagenomic datasets become available over time. This method could help gain some insight into the function of the vast number of proteins that we currently can not annotate.