SlideShare a Scribd company logo
Representative Proteomes
and Genomes
A standardized, stable and unbiased set of proteomes and
genomes
http://pir.georgetown.edu/rps/




Raja Mazumder (mazumder@gwu.edu)
Nomenclature

   Representative Proteomes
       Primarily computational
   Reference Proteomes/QFO
    Proteomes/Blessed Proteomes
       Primarily manual
   Extension of Reference proteomes
Representative Proteomes (RP) and Representative
                  genomes (RG)
                  http://pir.georgetown.edu/rps/
Procedure to generate the RPs
     Compute pair-wise co-membership value (X) in UniRef50 for all proteomes



             For each proteome, compute the mean co-membership between
                         this proteome and the other proteomes
             Create ranked proteome list based on the mean co-membership


                                 RPG generation starts



             For a given CMT, take the first proteome in the ranked list and
            the ones with X ≥ CMT to form an RPG, and remove them from
                                      ranked list


                                        ranked
                                                            No
                                          list
                                        empty?
                                            Yes

                                  RPG generation ends



   Select a Representative Proteome for each RPG (manually inspected by curator)
Proteome A Proteome B   UniRef50

     UniRef50
   Sequence clusters (UniRef100, 90, 50)
   From any organism
   Part of UniProt production cycle
   PMID: 17379688
RPs at Different CMTs
        1000                                          100
         900                                          90
               3.02
         800                                          80
         700           2.69                           70                   # RPs
                                                                              Million proteins
         600                   2.36                   60
# RPs




                                                                           % Reduction -
         500                                          50 %                          1
                                                                           Proteomes
         400                                          40
                                          2.02                             % Reduction -
         300                                          30                   Sequences
                                                                                     2

         200                                          20
         100                                          10
           0                                          0
               75      55      35         15
                                      1
                                        Based on 1144 complete genomes
                         CMT (%)      2
                                        Based on 4.3 million sequences (complete genomes only)
                                                   UniProtKB total: 13.46 million sequences
• RP at higher level is used to cluster the lower levels
• RPGs are constructed based on co-membership, not
  taxonomy
Manual mapping of UniProt and
NCBI genomes
   The taxonomy ID of each proteome present
    in UniProt is mapped to the NCBI
    RefSeq/GenBank genome project IDs
   When more than one genome is available
    for the same taxonomy ID, the genomes are
    ranked according to the availability of a
    RefSeq genome, number of related
    publications, number of citations for each
    publication, and date of sequencing.
   The highest ranking genome is mapped to
    the UniProtKB proteome
RefSeq genomes and proteomes
      Mapping allows us to retrieve genomes and
       proteomes from RefSeq.




               ftp://ftp.pir.georgetown.edu/databases/rps/rg
ftp://ftp.pir.georgetown.edu/databases/rps/rp_in_refseq_sequences/
RP55 Over Time
                       1400                                                                      120       # complete proteomes
                                                                                                           # RPGs
                                                                                                           % species in multiple RPGs
                       1200
                                                                                                 100       % stable RPGs


                       1000
# Complete proteomes




                                                                                                 80

                       800

                                                                                                 60    %
                       600

                                                                                                 40
                       400


                                                                                                 20
                       200


                         0                                                                       0
                              2004_1   2005_4   2006_7   2007_10   2008_13   2009_15   2010_09
                                                    UniProtKB release
Coverage Statistics – RP55

   95% of all InterPro families contain at least
    one protein from the RP set
   InterPro covers ~75% of all proteins in
    UniProtKB, and this number holds true as
    well for RP55
   93% of the experimentally-characterized
    proteins are retained in the RP set
Coverage and use
Downloads
Make your own set
Visualizing all-against-all proteome
correlation matrix vs. the taxonomy
tree
   Developed a method to graphically visualize NCBI's taxonomy
    tree and overlay the proteome correlation tree (PCT) to illustrate
    genomic similarity between organisms that may otherwise be
    considered to have distant ancestry.
   Computed all-against-all correlation values between all complete
    proteomes
   Comparison network can be browsed in Cytoscape network
    software to easily identify nodes in the taxonomy tree that are not
    supported by PCT data
   Development tools: CytoscapeWeb, CytoscapeRPC, Perl
Family
                Enterobacteriaceae                                    Distance based on taxonomy tree




         Shigella   Escherichia           Enterobacter   Klebsiella            Genus
                                                                      Distance based on taxonomy tree




                                                                               Species
                                                ENT38                 Distance based on correlation table

                       ECOLI
        ESCF3                     ECO24
                                                          ENTAK
ECOK1                                                                  KLEP7
                        SHIFL
                                  SHIF8
Example: Examine correlation scores of
AGRT5

        Agrobacterium tumefaciens (AGRT5)




  http://pir.georgetown.edu/cgi-bin/rps_tree.pl?point_id=r15p176299&on=1&on100=1&file_id=122063&p=#-5
Can easily identify genomic neighbors

   The top 2 levels are family
    and genus nodes arranged
    according to taxonomic
    position
   The bottom nodes are
    complete proteomes with a
    heuristic force-directed
    layout applied according to
    all-against-all correlation
   Although AGRT5 and
    AGRVS share the same
    genus, they are relatively
    distant from each other
    (~28%), compared to
    AGRT5 and AGRSH
    (~70%).
Sequence search
   Cleaner BLAST/phmmer results
Conclusions
   High quality RPs generated computationally and
    inspected by curators
   A standardized, stable and unbiased set of proteomes
    and genomes
   Completely integrated and into the UniProt/UniRef
    production pipeline and has monthly releases
   Automatically selects QFO/UniProt RF (if available in
    RPG) as RP (provide feedback to QFO and others if
    discrepancy)
   Extended to RefSeq (
    ftp://ftp.pir.georgetown.edu/databases/rps/rg;
    ftp://ftp.pir.georgetown.edu/databases/rps/rp_in_refseq_s
    equences/)
   RGs can help placement of unknown metagenomic
    sequences into the correct clusters
Acknowledgements
             Chuming Chen (PIR)
             Darren Natale (PIR)
           Hongzhan Huang (PIR)
              Jian Zhang (PIR)
            Peter McGarvey (PIR)
               Cathy Wu (PIR)
            Mona Motwani (GWU)
           Jamal Theodore (GWU)
    Robert Finn (HHMI Janelia Farm/Pfam)
            Eleanor Stanley (EBI)
              Kim Pruitt (NCBI)
               Yuri Wolf (NCBI)
             UniProt Consortium

More Related Content

Viewers also liked

Asas kepada perdagangan
Asas kepada perdaganganAsas kepada perdagangan
Asas kepada perdagangan
Ruby Adawiyah
 
2015 Chevy Sonic In South Jersey
2015 Chevy Sonic In South Jersey2015 Chevy Sonic In South Jersey
2015 Chevy Sonic In South Jersey
RK Chevrolet
 
Libro blanco ciencias ambientales
Libro blanco ciencias ambientalesLibro blanco ciencias ambientales
Libro blanco ciencias ambientales
Nuria Gª Alfaro
 
Wayoflovepowerpoint
WayoflovepowerpointWayoflovepowerpoint
Dream weaver
Dream weaverDream weaver
Dream weaver
Terry Murray
 
Classification of Enterobacteriaceae family
Classification of Enterobacteriaceae familyClassification of Enterobacteriaceae family
Classification of Enterobacteriaceae familyAbhijit Chaudhury
 
Eubacteria ppt
Eubacteria pptEubacteria ppt
Eubacteria pptsbarkanic
 
Menestyjäksi Lapissa, Elinikäisen ohjauksen toimintamallin kehittäminen -hank...
Menestyjäksi Lapissa, Elinikäisen ohjauksen toimintamallin kehittäminen -hank...Menestyjäksi Lapissa, Elinikäisen ohjauksen toimintamallin kehittäminen -hank...
Menestyjäksi Lapissa, Elinikäisen ohjauksen toimintamallin kehittäminen -hank...Menestyjäksi Lapissa - hanke
 
先端研読書会討論
先端研読書会討論先端研読書会討論
先端研読書会討論Masahiro Yamada
 

Viewers also liked (11)

Asas kepada perdagangan
Asas kepada perdaganganAsas kepada perdagangan
Asas kepada perdagangan
 
2015 Chevy Sonic In South Jersey
2015 Chevy Sonic In South Jersey2015 Chevy Sonic In South Jersey
2015 Chevy Sonic In South Jersey
 
Libro blanco ciencias ambientales
Libro blanco ciencias ambientalesLibro blanco ciencias ambientales
Libro blanco ciencias ambientales
 
Wayoflovepowerpoint
WayoflovepowerpointWayoflovepowerpoint
Wayoflovepowerpoint
 
Dream weaver
Dream weaverDream weaver
Dream weaver
 
Classification of Enterobacteriaceae family
Classification of Enterobacteriaceae familyClassification of Enterobacteriaceae family
Classification of Enterobacteriaceae family
 
Eubacteria ppt
Eubacteria pptEubacteria ppt
Eubacteria ppt
 
December newsletter2016
December newsletter2016December newsletter2016
December newsletter2016
 
Menestyjäksi Lapissa, Elinikäisen ohjauksen toimintamallin kehittäminen -hank...
Menestyjäksi Lapissa, Elinikäisen ohjauksen toimintamallin kehittäminen -hank...Menestyjäksi Lapissa, Elinikäisen ohjauksen toimintamallin kehittäminen -hank...
Menestyjäksi Lapissa, Elinikäisen ohjauksen toimintamallin kehittäminen -hank...
 
Menestyjäksi Lapissa - hanke-esittely
Menestyjäksi Lapissa - hanke-esittelyMenestyjäksi Lapissa - hanke-esittely
Menestyjäksi Lapissa - hanke-esittely
 
先端研読書会討論
先端研読書会討論先端研読書会討論
先端研読書会討論
 

Recently uploaded

Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 

Recently uploaded (20)

Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 

Bc2012 submission 109a

  • 1. Representative Proteomes and Genomes A standardized, stable and unbiased set of proteomes and genomes http://pir.georgetown.edu/rps/ Raja Mazumder (mazumder@gwu.edu)
  • 2. Nomenclature  Representative Proteomes  Primarily computational  Reference Proteomes/QFO Proteomes/Blessed Proteomes  Primarily manual  Extension of Reference proteomes
  • 3. Representative Proteomes (RP) and Representative genomes (RG) http://pir.georgetown.edu/rps/
  • 4. Procedure to generate the RPs Compute pair-wise co-membership value (X) in UniRef50 for all proteomes For each proteome, compute the mean co-membership between this proteome and the other proteomes Create ranked proteome list based on the mean co-membership RPG generation starts For a given CMT, take the first proteome in the ranked list and the ones with X ≥ CMT to form an RPG, and remove them from ranked list ranked No list empty? Yes RPG generation ends Select a Representative Proteome for each RPG (manually inspected by curator)
  • 5. Proteome A Proteome B UniRef50 UniRef50  Sequence clusters (UniRef100, 90, 50)  From any organism  Part of UniProt production cycle  PMID: 17379688
  • 6. RPs at Different CMTs 1000 100 900 90 3.02 800 80 700 2.69 70 # RPs Million proteins 600 2.36 60 # RPs % Reduction - 500 50 % 1 Proteomes 400 40 2.02 % Reduction - 300 30 Sequences 2 200 20 100 10 0 0 75 55 35 15 1 Based on 1144 complete genomes CMT (%) 2 Based on 4.3 million sequences (complete genomes only) UniProtKB total: 13.46 million sequences
  • 7. • RP at higher level is used to cluster the lower levels • RPGs are constructed based on co-membership, not taxonomy
  • 8. Manual mapping of UniProt and NCBI genomes  The taxonomy ID of each proteome present in UniProt is mapped to the NCBI RefSeq/GenBank genome project IDs  When more than one genome is available for the same taxonomy ID, the genomes are ranked according to the availability of a RefSeq genome, number of related publications, number of citations for each publication, and date of sequencing.  The highest ranking genome is mapped to the UniProtKB proteome
  • 9. RefSeq genomes and proteomes  Mapping allows us to retrieve genomes and proteomes from RefSeq. ftp://ftp.pir.georgetown.edu/databases/rps/rg ftp://ftp.pir.georgetown.edu/databases/rps/rp_in_refseq_sequences/
  • 10. RP55 Over Time 1400 120 # complete proteomes # RPGs % species in multiple RPGs 1200 100 % stable RPGs 1000 # Complete proteomes 80 800 60 % 600 40 400 20 200 0 0 2004_1 2005_4 2006_7 2007_10 2008_13 2009_15 2010_09 UniProtKB release
  • 11. Coverage Statistics – RP55  95% of all InterPro families contain at least one protein from the RP set  InterPro covers ~75% of all proteins in UniProtKB, and this number holds true as well for RP55  93% of the experimentally-characterized proteins are retained in the RP set
  • 15. Visualizing all-against-all proteome correlation matrix vs. the taxonomy tree  Developed a method to graphically visualize NCBI's taxonomy tree and overlay the proteome correlation tree (PCT) to illustrate genomic similarity between organisms that may otherwise be considered to have distant ancestry.  Computed all-against-all correlation values between all complete proteomes  Comparison network can be browsed in Cytoscape network software to easily identify nodes in the taxonomy tree that are not supported by PCT data  Development tools: CytoscapeWeb, CytoscapeRPC, Perl
  • 16. Family Enterobacteriaceae Distance based on taxonomy tree Shigella Escherichia Enterobacter Klebsiella Genus Distance based on taxonomy tree Species ENT38 Distance based on correlation table ECOLI ESCF3 ECO24 ENTAK ECOK1 KLEP7 SHIFL SHIF8
  • 17. Example: Examine correlation scores of AGRT5  Agrobacterium tumefaciens (AGRT5) http://pir.georgetown.edu/cgi-bin/rps_tree.pl?point_id=r15p176299&on=1&on100=1&file_id=122063&p=#-5
  • 18. Can easily identify genomic neighbors  The top 2 levels are family and genus nodes arranged according to taxonomic position  The bottom nodes are complete proteomes with a heuristic force-directed layout applied according to all-against-all correlation  Although AGRT5 and AGRVS share the same genus, they are relatively distant from each other (~28%), compared to AGRT5 and AGRSH (~70%).
  • 19. Sequence search  Cleaner BLAST/phmmer results
  • 20. Conclusions  High quality RPs generated computationally and inspected by curators  A standardized, stable and unbiased set of proteomes and genomes  Completely integrated and into the UniProt/UniRef production pipeline and has monthly releases  Automatically selects QFO/UniProt RF (if available in RPG) as RP (provide feedback to QFO and others if discrepancy)  Extended to RefSeq ( ftp://ftp.pir.georgetown.edu/databases/rps/rg; ftp://ftp.pir.georgetown.edu/databases/rps/rp_in_refseq_s equences/)  RGs can help placement of unknown metagenomic sequences into the correct clusters
  • 21. Acknowledgements Chuming Chen (PIR) Darren Natale (PIR) Hongzhan Huang (PIR) Jian Zhang (PIR) Peter McGarvey (PIR) Cathy Wu (PIR) Mona Motwani (GWU) Jamal Theodore (GWU) Robert Finn (HHMI Janelia Farm/Pfam) Eleanor Stanley (EBI) Kim Pruitt (NCBI) Yuri Wolf (NCBI) UniProt Consortium

Editor's Notes

  1. For the purpose of making the Representative Proteome Groups, we created a list of organisms ranked by mean co-membership, meaning that the average connectedness to all other organisms was computed. The list was then used to recruit organisms based on whether or not the co-membership score was greater than the CMT. We did some tests and found that varying the ranking system changed no more than 2% of the RPGs, and none of the RPs (meaning that in those cases, one organism “jumped ship” into another group).
  2. In order to ensure that the RPs are hierarchical at different CMT’s we first calculate RP75, then take that set to generate RP55 and so on. Any RP in a smaller set (such as RP15) will be thus be found in all larger sets (such as RP75). The genus change for the Vibrio salmonicida to Aliivibrio illustrates why using taxonomy as the mechanism for determining RPGs is a bad idea: taxonomy can change, but the sequences won’t.
  3. Stability of RPs is illustrated by a retrospective study. Using a UniProt release from 2004, there were 116 RPs. All 116 remained as RPs throughout the test period. Furthermore, any added RP (in 2005, 2006, etc) remained as RP in subsequent releases. The comparison of number of RP and number of complete proteomes illustrates that RPs are growing at a slower rate. The % reduction of sequence space is indicated by red and purple lines, which shows that there is a steady increase in the reduction rate. This rate is dependent upon the type of genomes being sequenced: if many related genomes (yet another E coli, for example) are added, then the reduction increases (see 2006_7).
  4. The InterPro coverage stats indicate that the RP set covers most of the sequence space in terms of similarity families. They also indicate that there is not much significant over- or under-representation of sequences. The missing InterPro families tended to be viral (which are not yet included in RP) or lineage-specific. Experimentally-characterized proteins were determined using literature references in Swiss-Prot entries or GOA evidence codes. There are 30,000 such proteins in the examined set.