Representative Proteomesand GenomesA standardized, stable and unbiased set of proteomes andgenomeshttp://pir.georgetown.ed...
Nomenclature   Representative Proteomes       Primarily computational   Reference Proteomes/QFO    Proteomes/Blessed Pr...
Representative Proteomes (RP) and Representative                  genomes (RG)                  http://pir.georgetown.edu/...
Procedure to generate the RPs     Compute pair-wise co-membership value (X) in UniRef50 for all proteomes             For ...
Proteome A Proteome B   UniRef50     UniRef50   Sequence clusters (UniRef100, 90, 50)   From any organism   Part of Uni...
RPs at Different CMTs        1000                                          100         900                                ...
• RP at higher level is used to cluster the lower levels• RPGs are constructed based on co-membership, not  taxonomy
Manual mapping of UniProt andNCBI genomes   The taxonomy ID of each proteome present    in UniProt is mapped to the NCBI ...
RefSeq genomes and proteomes      Mapping allows us to retrieve genomes and       proteomes from RefSeq.               ft...
RP55 Over Time                       1400                                                                      120       #...
Coverage Statistics – RP55   95% of all InterPro families contain at least    one protein from the RP set   InterPro cov...
Coverage and use
Downloads
Make your own set
Visualizing all-against-all proteomecorrelation matrix vs. the taxonomytree   Developed a method to graphically visualize...
Family                Enterobacteriaceae                                    Distance based on taxonomy tree         Shigel...
Example: Examine correlation scores ofAGRT5        Agrobacterium tumefaciens (AGRT5)  http://pir.georgetown.edu/cgi-bin/r...
Can easily identify genomic neighbors   The top 2 levels are family    and genus nodes arranged    according to taxonomic...
Sequence search   Cleaner BLAST/phmmer results
Conclusions   High quality RPs generated computationally and    inspected by curators   A standardized, stable and unbia...
Acknowledgements             Chuming Chen (PIR)             Darren Natale (PIR)           Hongzhan Huang (PIR)            ...
Upcoming SlideShare
Loading in …5
×

Bc2012 submission 109a

716
-1

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
716
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • For the purpose of making the Representative Proteome Groups, we created a list of organisms ranked by mean co-membership, meaning that the average connectedness to all other organisms was computed. The list was then used to recruit organisms based on whether or not the co-membership score was greater than the CMT. We did some tests and found that varying the ranking system changed no more than 2% of the RPGs, and none of the RPs (meaning that in those cases, one organism “jumped ship” into another group).
  • In order to ensure that the RPs are hierarchical at different CMT’s we first calculate RP75, then take that set to generate RP55 and so on. Any RP in a smaller set (such as RP15) will be thus be found in all larger sets (such as RP75). The genus change for the Vibrio salmonicida to Aliivibrio illustrates why using taxonomy as the mechanism for determining RPGs is a bad idea: taxonomy can change, but the sequences won’t.
  • Stability of RPs is illustrated by a retrospective study. Using a UniProt release from 2004, there were 116 RPs. All 116 remained as RPs throughout the test period. Furthermore, any added RP (in 2005, 2006, etc) remained as RP in subsequent releases. The comparison of number of RP and number of complete proteomes illustrates that RPs are growing at a slower rate. The % reduction of sequence space is indicated by red and purple lines, which shows that there is a steady increase in the reduction rate. This rate is dependent upon the type of genomes being sequenced: if many related genomes (yet another E coli, for example) are added, then the reduction increases (see 2006_7).
  • The InterPro coverage stats indicate that the RP set covers most of the sequence space in terms of similarity families. They also indicate that there is not much significant over- or under-representation of sequences. The missing InterPro families tended to be viral (which are not yet included in RP) or lineage-specific. Experimentally-characterized proteins were determined using literature references in Swiss-Prot entries or GOA evidence codes. There are 30,000 such proteins in the examined set.
  • Bc2012 submission 109a

    1. 1. Representative Proteomesand GenomesA standardized, stable and unbiased set of proteomes andgenomeshttp://pir.georgetown.edu/rps/Raja Mazumder (mazumder@gwu.edu)
    2. 2. Nomenclature Representative Proteomes  Primarily computational Reference Proteomes/QFO Proteomes/Blessed Proteomes  Primarily manual Extension of Reference proteomes
    3. 3. Representative Proteomes (RP) and Representative genomes (RG) http://pir.georgetown.edu/rps/
    4. 4. Procedure to generate the RPs Compute pair-wise co-membership value (X) in UniRef50 for all proteomes For each proteome, compute the mean co-membership between this proteome and the other proteomes Create ranked proteome list based on the mean co-membership RPG generation starts For a given CMT, take the first proteome in the ranked list and the ones with X ≥ CMT to form an RPG, and remove them from ranked list ranked No list empty? Yes RPG generation ends Select a Representative Proteome for each RPG (manually inspected by curator)
    5. 5. Proteome A Proteome B UniRef50 UniRef50 Sequence clusters (UniRef100, 90, 50) From any organism Part of UniProt production cycle PMID: 17379688
    6. 6. RPs at Different CMTs 1000 100 900 90 3.02 800 80 700 2.69 70 # RPs Million proteins 600 2.36 60# RPs % Reduction - 500 50 % 1 Proteomes 400 40 2.02 % Reduction - 300 30 Sequences 2 200 20 100 10 0 0 75 55 35 15 1 Based on 1144 complete genomes CMT (%) 2 Based on 4.3 million sequences (complete genomes only) UniProtKB total: 13.46 million sequences
    7. 7. • RP at higher level is used to cluster the lower levels• RPGs are constructed based on co-membership, not taxonomy
    8. 8. Manual mapping of UniProt andNCBI genomes The taxonomy ID of each proteome present in UniProt is mapped to the NCBI RefSeq/GenBank genome project IDs When more than one genome is available for the same taxonomy ID, the genomes are ranked according to the availability of a RefSeq genome, number of related publications, number of citations for each publication, and date of sequencing. The highest ranking genome is mapped to the UniProtKB proteome
    9. 9. RefSeq genomes and proteomes  Mapping allows us to retrieve genomes and proteomes from RefSeq. ftp://ftp.pir.georgetown.edu/databases/rps/rgftp://ftp.pir.georgetown.edu/databases/rps/rp_in_refseq_sequences/
    10. 10. RP55 Over Time 1400 120 # complete proteomes # RPGs % species in multiple RPGs 1200 100 % stable RPGs 1000# Complete proteomes 80 800 60 % 600 40 400 20 200 0 0 2004_1 2005_4 2006_7 2007_10 2008_13 2009_15 2010_09 UniProtKB release
    11. 11. Coverage Statistics – RP55 95% of all InterPro families contain at least one protein from the RP set InterPro covers ~75% of all proteins in UniProtKB, and this number holds true as well for RP55 93% of the experimentally-characterized proteins are retained in the RP set
    12. 12. Coverage and use
    13. 13. Downloads
    14. 14. Make your own set
    15. 15. Visualizing all-against-all proteomecorrelation matrix vs. the taxonomytree Developed a method to graphically visualize NCBIs taxonomy tree and overlay the proteome correlation tree (PCT) to illustrate genomic similarity between organisms that may otherwise be considered to have distant ancestry. Computed all-against-all correlation values between all complete proteomes Comparison network can be browsed in Cytoscape network software to easily identify nodes in the taxonomy tree that are not supported by PCT data Development tools: CytoscapeWeb, CytoscapeRPC, Perl
    16. 16. Family Enterobacteriaceae Distance based on taxonomy tree Shigella Escherichia Enterobacter Klebsiella Genus Distance based on taxonomy tree Species ENT38 Distance based on correlation table ECOLI ESCF3 ECO24 ENTAKECOK1 KLEP7 SHIFL SHIF8
    17. 17. Example: Examine correlation scores ofAGRT5  Agrobacterium tumefaciens (AGRT5) http://pir.georgetown.edu/cgi-bin/rps_tree.pl?point_id=r15p176299&on=1&on100=1&file_id=122063&p=#-5
    18. 18. Can easily identify genomic neighbors The top 2 levels are family and genus nodes arranged according to taxonomic position The bottom nodes are complete proteomes with a heuristic force-directed layout applied according to all-against-all correlation Although AGRT5 and AGRVS share the same genus, they are relatively distant from each other (~28%), compared to AGRT5 and AGRSH (~70%).
    19. 19. Sequence search Cleaner BLAST/phmmer results
    20. 20. Conclusions High quality RPs generated computationally and inspected by curators A standardized, stable and unbiased set of proteomes and genomes Completely integrated and into the UniProt/UniRef production pipeline and has monthly releases Automatically selects QFO/UniProt RF (if available in RPG) as RP (provide feedback to QFO and others if discrepancy) Extended to RefSeq ( ftp://ftp.pir.georgetown.edu/databases/rps/rg; ftp://ftp.pir.georgetown.edu/databases/rps/rp_in_refseq_s equences/) RGs can help placement of unknown metagenomic sequences into the correct clusters
    21. 21. Acknowledgements Chuming Chen (PIR) Darren Natale (PIR) Hongzhan Huang (PIR) Jian Zhang (PIR) Peter McGarvey (PIR) Cathy Wu (PIR) Mona Motwani (GWU) Jamal Theodore (GWU) Robert Finn (HHMI Janelia Farm/Pfam) Eleanor Stanley (EBI) Kim Pruitt (NCBI) Yuri Wolf (NCBI) UniProt Consortium

    ×