The annotation of Plant Proteins in           UniProtKB                     Michel Schneider     Plant protein annotation ...
1. The UniProt consortium and its products2. Content of an entry in UniProtKB and manual curation3. Complete proteomes and...
The UniProt consortium     “Pioneers at the Heart of Science” 1998 – 2008                      PAG XX, San Diego, January ...
The missions of the UniProt consortiumProvide the scientific community with a resource of proteinsequence and functional a...
Four components to fulfill specific demands                                   UniProtKB                             Protei...
UniProtKB, the expertly curatedcomponent of UniProt The high-quality curated protein knowledge database     where data bec...
UniProtKB, the expertly curatedcomponent of UniProt                                                  Shigeo Fukuda     “Pi...
Protein sequence             One gene - One species© 2009 SIB
Protein and gene names         Taxonomic information                                   Protein sequence                   ...
Protein and gene names         Taxonomic information                                                                Sequen...
Protein and gene names                                                                    General annotation:         Taxo...
Protein and gene names                                                                    General annotation:         Taxo...
Protein and gene names                                                                    General annotation:         Taxo...
Protein and gene names                                                                    General annotation:         Taxo...
Origin of the sequences in UniProtKB International Nucleotide Sequence Database Collection  (INSDC) Ensembl or EnsemblGe...
The process of manual sequence curation    1. Select entry/gene (priorities)    2. Identify entries from same gene and hom...
Critical analysis and report of sequence discrepanciesQPCT_ARATH (Q84WV9) Glutaminyl-peptide cyclotransferase (At4g25720) ...
Critical analysis and report of sequence discrepanciesQPCT_ARATH (Q84WV9) Glutaminyl-peptide cyclotransferase (At4g25720) ...
“Pioneers at the Heart of Science” 1998 – 2008                 PAG XX, San Diego, January 15, 2012
Literature-based curation Identify relevant papers through searching literature  databases Read full text of papers and ...
Literature-based curation     “Pioneers at the Heart of Science” 1998 – 2008                      PAG XX, San Diego, Janua...
Literature-based curation     “Pioneers at the Heart of Science” 1998 – 2008                      PAG XX, San Diego, Janua...
Literature-based curation     “Pioneers at the Heart of Science” 1998 – 2008                      PAG XX, San Diego, Janua...
Controlled vocabularies• Keywords provide a summary of the entry content• We annotate using the Gene Ontology (GO)      “P...
UniProtKB, complete proteomesequence sets  • Genome completely sequenced  • Proteins mapped to the genome  2’902 complete ...
UniProtKB, reference proteomesequence setsA reference proteome is the complete proteome of arepresentative, well-studied m...
UniProtKB, complete proteomesequence sets    “Pioneers at the Heart of Science” 1998 – 2008                     PAG XX, Sa...
Arabidopsis thalianaThe building of the complete proteome sequence set:• Based on the re-annotation of complete genome by ...
UniProtKB – TAIR synchronization   cDNAs, ESTs,   genomic sequences                                        Nucleic acid   ...
UniProtKB – TAIR synchronizationcDNAs, ESTs,genomic sequences                                                       Genome...
UniProtKB – TAIR synchronizationcDNAs, ESTs,genomic sequences                                                       Genome...
UniProtKB – TAIR synchronizationcDNAs, ESTs,genomic sequences                                                      Genome ...
UniProtKB – TAIR synchronizationcDNAs, ESTs,genomic sequences                                                     Genome r...
UniProtKB – TAIR synchronization    cDNAs, ESTs,    genomic sequences                                                     ...
1001 Arabidopsis genomes• Deposited to INSDC ?• Fully Annotated ? With CDS ?• Should we still merge all the identical sequ...
Some UniProtKB/Swiss-Prot Statisticsconcerning plant entries(UniProt release 2011_12 - Dec 14, 2011)• 31,959 entries of Vi...
SummaryUniProtKB/Swiss-Prot, the manually curated knowledgebase:• Protein sequence database covering all kingdoms of life ...
We need your feedback and your collaboration !                   help@uniprot.org      “Pioneers at the Heart of Science” ...
AcknowledgementsSIBIoannis Xenarios, Lydie Bougueleret, Andrea Auchincloss, Kristian Axelsen, Delphine Baratin, Marie-Clau...
UniProt is mainly supported by the National Institutes ofHealth (NIH) grant 1 U41 HG006104-01. Additional support forthe E...
Upcoming SlideShare
Loading in...5
×

The annotation of plant proteins in UniProtKB

623

Published on

Event: Plant and Animal Genomes conference 2012
Speaker: Michel Schneider

The UniProt Knowledgebase consists of two sections, UniProtKB/Swiss-Prot, which contains manually-annotated protein sequence enriched with functional information added by expert human curators, and UniProtKB/TrEMBL, which contains unreviewed records that are enhanced by information provided by automated rule-based annotation systems. The majority of UniProtKB records are based on automatic translation of coding sequences (CDS) provided by submitters at the time of initial deposition to the nucleotide sequence databases. In order to provide the complete proteome of Arabidopsis thaliana, a complementary curation pipeline for import of protein sequences from TAIR has been developed. As the complete genome reannotation proposed in the TAIR10 release contains most of the sequences already in UniProtKB, these existing sequences have to be reconciled with those imported. Around 7% of them have a different gene model and should be checked manually. Based on these comparisons, we improved over 200 of our predicted proteins. In exchange, we provide TAIR with the gene model corrections that we introduce on the bases of our trans-species family annotation. This approach allows identification of data that can be seamlessly transferred from one site to the other and the development of common annotations. With the significant increase in the number of complete genomes sequenced (1001 Arabidopsis cultivars are currently under way!), organization of this data in a convenient way is critical. UniProt have selected a set of “reference proteomes”, including A. thaliana cv. Columbia, which provide broad coverage of the tree of life and constitute a representative cross-section of the taxonomic diversity to be found within UniProtKB.


Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
623
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
18
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Alignment of sequences deduced from 2 genomic DNAs, one cDNA and one ESTAnnotation of erroneous gene model predictions
  • Annotation of isoforms
  • Information about how to reconstruct all isoformsAccess to the sequences of all isoformsCan apply various tools
  • The sequencing of 1001 Arabidopsis genomes is raising several questions and we have to find new solutionsIf not merged, one solution for the blast is to use UniRef, but only valid for functional annotation and not for finding if an homologous protein is already known in a given species
  • The annotation of plant proteins in UniProtKB

    1. 1. The annotation of Plant Proteins in UniProtKB Michel Schneider Plant protein annotation program, Swiss-Prot group Swiss Institute of Bioinformatics Geneva, Switzerland Michel.Schneider@isb-sib.ch
    2. 2. 1. The UniProt consortium and its products2. Content of an entry in UniProtKB and manual curation3. Complete proteomes and reference proteomes4. Synchronization between UniProtKB and TAIR5. Some statistics “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    3. 3. The UniProt consortium “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    4. 4. The missions of the UniProt consortiumProvide the scientific community with a resource of proteinsequence and functional annotation which has to be … comprehensive high quality and freely accessible “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    5. 5. Four components to fulfill specific demands UniProtKB Protein Knowledgebase UniRef UniProtKB/Swiss-Prot UniMes Sequence clusters Reviewed Metagenomic and UniRef100 (533’657 entries) UniRef90 environmental Manual curation sample sequences UniRef50 UniProtKB/Trembl Unreviewed (19 million entries) Automated annotation UniParc – Sequence archive contains current and obsolete sequences (29.6 million sequences) “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    6. 6. UniProtKB, the expertly curatedcomponent of UniProt The high-quality curated protein knowledge database where data becomes structured knowledge “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    7. 7. UniProtKB, the expertly curatedcomponent of UniProt Shigeo Fukuda “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    8. 8. Protein sequence One gene - One species© 2009 SIB
    9. 9. Protein and gene names Taxonomic information Protein sequence One gene - One species© 2009 SIB
    10. 10. Protein and gene names Taxonomic information Sequence annotation: PTMs, alternative splicing products, Protein sequence mutagenesis, transmembrane domains, One gene - One species signal peptide…© 2009 SIB
    11. 11. Protein and gene names General annotation: Taxonomic information Function, Subcellular location, Catalytic activity, Tissue specificity, Disruption phenotype… Sequence annotation: PTMs, alternative splicing products, Protein sequence mutagenesis, transmembrane domains, One gene - One species signal peptide…© 2009 SIB
    12. 12. Protein and gene names General annotation: Taxonomic information Function, Subcellular location, Catalytic activity, Tissue specificity, Disruption phenotype… Sequence annotation: References PTMs, alternative splicing products, Protein sequence mutagenesis, transmembrane domains, One gene - One species signal peptide…© 2009 SIB
    13. 13. Protein and gene names General annotation: Taxonomic information Function, Subcellular location, Catalytic activity, Tissue specificity, Disruption phenotype… Sequence annotation: References PTMs, alternative splicing products, Protein sequence mutagenesis, transmembrane domains, One gene - One species signal peptide… Keywords - Gene Ontology© 2009 SIB
    14. 14. Protein and gene names General annotation: Taxonomic information Function, Subcellular location, Catalytic activity, Tissue specificity, Disruption phenotype… Sequence annotation: References PTMs, alternative splicing products, Protein sequence mutagenesis, transmembrane domains, One gene - One species signal peptide… Keywords Cross-references - Gene Ontology (~ 130 databases)© 2009 SIB
    15. 15. Origin of the sequences in UniProtKB International Nucleotide Sequence Database Collection (INSDC) Ensembl or EnsemblGenomes RefSeq Direct submissions (protein sequences) Literature Protein Data Bank “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    16. 16. The process of manual sequence curation 1. Select entry/gene (priorities) 2. Identify entries from same gene and homologs using BLAST against UniProtKB 3. Merge entries from the same gene and same species into a single record 4. Select a canonical sequence “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    17. 17. Critical analysis and report of sequence discrepanciesQPCT_ARATH (Q84WV9) Glutaminyl-peptide cyclotransferase (At4g25720) “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    18. 18. Critical analysis and report of sequence discrepanciesQPCT_ARATH (Q84WV9) Glutaminyl-peptide cyclotransferase (At4g25720) “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    19. 19. “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    20. 20. Literature-based curation Identify relevant papers through searching literature databases Read full text of papers and extract and summarize relevant information “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    21. 21. Literature-based curation “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    22. 22. Literature-based curation “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    23. 23. Literature-based curation “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    24. 24. Controlled vocabularies• Keywords provide a summary of the entry content• We annotate using the Gene Ontology (GO) “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    25. 25. UniProtKB, complete proteomesequence sets • Genome completely sequenced • Proteins mapped to the genome 2’902 complete proteomes Fully manually reviewed (e.g. S. cerevisiae) Partially manually reviewed (e.g. A. thaliana) Unreviewed (e.g. Chlorella variabilis) “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    26. 26. UniProtKB, reference proteomesequence setsA reference proteome is the complete proteome of arepresentative, well-studied model organism or an organismof interest for biomedical research.509 reference proteomes “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    27. 27. UniProtKB, complete proteomesequence sets “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    28. 28. Arabidopsis thalianaThe building of the complete proteome sequence set:• Based on the re-annotation of complete genome by TAIR: 27’416 protein coding genes “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    29. 29. UniProtKB – TAIR synchronization cDNAs, ESTs, genomic sequences Nucleic acid databases UniProtKB/TrEMBL Unreviewed (40’574 entries) UniProtKB/Swiss-Prot Reviewed (10’340 entries)release 2011_03 - Mar 08, 2011 “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    30. 30. UniProtKB – TAIR synchronizationcDNAs, ESTs,genomic sequences Genome re-annotation 35’386 gene products Nucleic acid databasesUniProtKB/TrEMBL Temporary TrEMBL set 33’341 entries Unreviewed (40’574 entries)UniProtKB/Swiss-Prot Reviewed (10’340 entries) “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    31. 31. UniProtKB – TAIR synchronizationcDNAs, ESTs,genomic sequences Genome re-annotation 35’386 gene products Nucleic acid databasesUniProtKB/TrEMBL Temporary TrEMBL set 33’341 entries Unreviewed (40’574 entries) 11’508 sequencesUniProtKB/Swiss-Prot Compare translations from the same gene, merge if 100 % identical, report sequence discrepancies, align with Reviewed (10’340 entries) orthologs and paralogs “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    32. 32. UniProtKB – TAIR synchronizationcDNAs, ESTs,genomic sequences Genome re-annotation Nucleic acid databasesUniProtKB/TrEMBL Temporary TrEMBL set UnreviewedUniProtKB/Swiss-Prot Compare translations from the same gene, merge if 100 % identical, report sequence discrepancies, align with Reviewed orthologs and paralogs Feedback to TAIR 90 gene models correct gene models or add new isoforms 283 corrections at the Heart of Science” 1998 – 2008 “Pioneers PAG XX, San Diego, January 15, 2012
    33. 33. UniProtKB – TAIR synchronizationcDNAs, ESTs,genomic sequences Genome re-annotation Nucleic acid databasesUniProtKB/TrEMBL Temporary TrEMBL set Unreviewed Cleaned set of new TrEMBL entriesUniProtKB/Swiss-Prot (21’656 entries) Reviewed “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    34. 34. UniProtKB – TAIR synchronization cDNAs, ESTs, genomic sequences Genome re-annotation Nucleic acid databases UniProtKB/TrEMBL Temporary TrEMBL set Unreviewed (44’628 entries) Cleaned set of new TrEMBL entries UniProtKB/Swiss-Prot (21’656 entries) Reviewed + (10’875 entries) UniProtKB/Swiss-Prot Reviewed (10’865 entries)release 2011_12 - Dec 14, 2011 Arabidopsis thaliana, cv. Columbia Complete proteome: 32’521 entries “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    35. 35. 1001 Arabidopsis genomes• Deposited to INSDC ?• Fully Annotated ? With CDS ?• Should we still merge all the identical sequences together?• If they are not merged but kept separate, how to get relevant Blast results? “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    36. 36. Some UniProtKB/Swiss-Prot Statisticsconcerning plant entries(UniProt release 2011_12 - Dec 14, 2011)• 31,959 entries of Viridiplantae• from 1,924 species• 10’875 entries from Arabidopsis thaliana (with 1,219 isoforms)• 2,823 entries from Oryza sativa sp. Japonica• 11,897 plant entries with an EC number• 966 different complete EC numbers• 5,744 putative transporters or proteins involved in transport “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    37. 37. SummaryUniProtKB/Swiss-Prot, the manually curated knowledgebase:• Protein sequence database covering all kingdoms of life (533’657 sequence entries; 12’664 species)• Manually annotated• Non-redundant: all products of one gene in one species in a single entry• Highly cross-referenced (links to ~130 databases).Plant protein annotation:• Complete proteome for Arabidopsis thaliana• Synchronization with TAIR “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    38. 38. We need your feedback and your collaboration ! help@uniprot.org “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    39. 39. AcknowledgementsSIBIoannis Xenarios, Lydie Bougueleret, Andrea Auchincloss, Kristian Axelsen, Delphine Baratin, Marie-Claude Blatter,Brigitte Boeckmann, Jerven Bolleman, Laurent Bollondi, Emmanuel Boutet, Lionel Breuza, Alan Bridge, Edouard deCastro, Lorenzo Cerutti, Elisabeth Coudert, Béatrice Cuche, Mikael Doche, Dolnide Dornevil, Severine Duvaud, AnneEstreicher, Livia Famiglietti, Marc Feuermann, Sebastien Gehant, Elisabeth Gasteiger, Vivienne Gerritsen, Arnaud Gos,Nadine Gruaz-Gumowski, Ursula Hinz, Chantal Hulo, Nicolas Hulo, Janet James, Florence Jungo, Guillaume Keller,Vicente Lara, Philippe Lemercier, Damien Lieberherr, Xavier Martin, Patrick Masson, Anne Morgat, Salvo Paesano, IvoPedruzzi, Sandrine Pilbout, Sylvain Poux, Monica Pozzato, Manuela Pruess, Nicole Redaschi, Catherine Rivoire, BerndRoechert, Michel Schneider, Christian Sigrist, Karin Sonesson, Sylvie Staehli, Eleanor Stanley, André Stutz, ShyamalaSundaram, Michael Tognolli, Laure Verbregue and Anne-Lise VeutheyEBIRolf Apweiler, Maria Jesus Martin, Claire ODonovan, Michele Magrane, Yasmin Alam-Faruque, Ricardo Antunes,Benoit Bely, Mark Bingley, David Binns, Lawrence Bower, Wei Mun Chan, Emily Dimmer, Francesco Fazzini, AlexanderFedotov, John Garavelli, Leyla Garcia Castro, Rachael Huntley, Julius Jacobsen, Michael Kleen, Duncan Legge, WudongLiu, Jie Luo, Sandra Orchard, Samuel Patient, Klemens Pichler, Diego Poggioli, Nikolas Pontikos, Steven Rosanoff, TonySawford, Harminder Sehra, Edward Turner, Matt Corbett, Mike Donnelly and Pieter van RensburgPIRCathy H. Wu, Cecilia N. Arighi, Leslie Arminski, Winona C. Barker, Chuming Chen, Yongxing Chen, Pratibha Dubey,Hongzhan Huang, Kati Laiho, Raja Mazumder, Peter McGarvey, Darren A. Natale, Thanemozhi G. Natarajan, JulesNchoutmboube, Natalia V. Roberts, Baris E. Suzek, Uzoamaka Ugochukwu, C. R. Vinayaka, Qinghua Wang, Yuqi Wang,Lai-Su Yeh and Jian Zhang www.uniprot.org
    40. 40. UniProt is mainly supported by the National Institutes ofHealth (NIH) grant 1 U41 HG006104-01. Additional support forthe EBIs involvement in UniProt comes from the NIH grant2P41 HG02273-07. Swiss-Prot activities at the SIB aresupported by the Swiss Federal Government through theFederal Office of Education and Science and the EuropeanCommission contracts SLING (226073), Gen2Phen (200754)and MICROME (222886). PIR activities are also supported bythe NIH grants 5R01GM080646-04, 3R01GM080646-04S2,1G08LM010720-01, and 3P20RR016472-09S2, and NSF grantDBI-0850319. “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×